A new study evaluates seven major large language models including GPT-4, GPT-4o, and Gemini 1.5 Pro across three core social media analytics tasks: authorship verification, post generation, and user attribute inference using Twitter data.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | Comprehensive evaluation study of LLMs on social media analytics tasks |
| Who it is for | AI researchers and social media analysts |
| Where to get it | arXiv preprint |
| Price | Free |
- Seven LLMs tested: GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT
- Three evaluation tasks: social media authorship verification, post generation, and user attribute inference
- Study uses Twitter dataset with tweets from January 2024 onward to prevent data contamination
- User study measures real users’ perceptions of LLM-generated posts conditioned on their writing style
- Occupations and interests annotated using standardized taxonomies from IAB Tech Lab 2023 and U.S. SOC 2018
- This represents the first comprehensive evaluation of modern LLMs across multiple social media analytics tasks
- The study addresses data contamination by using fresh Twitter data from January 2024 onward
- Seven major language models are benchmarked against existing baselines using standardized evaluation metrics
- User perception studies bridge the gap between automated generation and human acceptance of AI content
- Reproducible benchmarks are established for future LLM-driven social media analytics research
What is LLM Social Media Analytics
LLM social media analytics applies large language models to understand, generate, and analyze social media content at scale. [1] Large language models are deep learning systems trained on massive datasets that can understand and generate natural language for various tasks. [2]
The field encompasses three primary applications: verifying content authorship, generating authentic user-like posts, and inferring user attributes from social media activity. These capabilities enable automated content moderation, personalized marketing, and user behavior analysis across platforms like Twitter, Facebook, and Instagram.
What is New vs Previous Studies
This study introduces the first unified evaluation framework comparing multiple state-of-the-art LLMs on social media tasks simultaneously.
| Previous Approaches | This Study |
|---|---|
| Single-task evaluations | Multi-task evaluation across three core areas |
| Limited model comparison | Seven major LLMs including GPT-4, GPT-4o, Gemini 1.5 Pro |
| Potential data contamination | Fresh Twitter data from January 2024 onward |
| Automated metrics only | Human perception studies for generated content |
| Ad-hoc evaluation frameworks | Standardized taxonomies (IAB Tech Lab 2023, U.S. SOC 2018) |
How Does the Evaluation Work
The evaluation framework systematically tests LLMs across three interconnected social media analytics tasks using standardized methodologies.
- Social Media Authorship Verification: Models determine whether specific users wrote given posts using diverse sampling strategies across user types and post characteristics
- Social Media Post Generation: LLMs create authentic user-like content evaluated through comprehensive metrics measuring authenticity, coherence, and style consistency
- User Attribute Inference: Models predict user occupations and interests from social media activity using IAB Tech Lab 2023 and U.S. SOC 2018 taxonomies
- Human Perception Study: Real users evaluate LLM-generated posts conditioned on their own writing styles to measure acceptance and authenticity
- Baseline Comparison: All models are benchmarked against existing traditional analytics methods using identical datasets and evaluation criteria
Benchmarks and Evidence
The study establishes reproducible benchmarks across multiple evaluation dimensions for seven major language models.
| Model | Task Coverage | Evaluation Metrics | Source |
|---|---|---|---|
| GPT-4 | All three tasks | Accuracy, authenticity, user perception | [Study] |
| GPT-4o | All three tasks | Accuracy, authenticity, user perception | [Study] |
| GPT-3.5-Turbo | All three tasks | Accuracy, authenticity, user perception | [Study] |
| Gemini 1.5 Pro | All three tasks | Accuracy, authenticity, user perception | [Study] |
| DeepSeek-V3 | All three tasks | Accuracy, authenticity, user perception | [Study] |
| Llama 3.2 | All three tasks | Accuracy, authenticity, user perception | [Study] |
| BERT | All three tasks | Accuracy, authenticity, user perception | [Study] |
Studies show that models like GPT-3.5 and GPT-4 can outperform human annotators on text classification tasks including political content moderation. [1] However, LLMs’ rapid improvement regularly renders benchmarks obsolete as models exceed human performance levels.
Who Should Care
Builders
AI developers building social media analytics tools gain standardized benchmarks for model selection and performance comparison. The evaluation framework provides reproducible methodologies for testing new models against established baselines across multiple tasks simultaneously.
Enterprise
Social media platforms and marketing agencies can leverage these findings to select optimal LLMs for content moderation, user profiling, and automated content generation. The study’s comprehensive evaluation helps enterprises make informed decisions about model deployment costs and capabilities.
End Users
Social media users benefit from improved content authenticity detection and more sophisticated automated moderation systems. The human perception studies ensure that AI-generated content meets user expectations for authenticity and relevance.
Investors
Investment decisions in AI companies focusing on social media analytics can be informed by the comparative performance data across seven major models. The study reveals market opportunities in LLM-powered social media tools and platforms.
How to Use LLMs for Social Media Today
Developers can implement LLM-based social media analytics using existing APIs and frameworks following the study’s methodologies.
- Access Model APIs: Obtain API keys for GPT-4, Gemini 1.5 Pro, or other evaluated models through their respective platforms
- Prepare Data: Collect and preprocess social media data following the study’s sampling framework for diverse user and post selection
- Implement Tasks: Deploy authorship verification, content generation, or attribute inference using the study’s prompt engineering approaches
- Evaluate Performance: Apply the standardized metrics and taxonomies (IAB Tech Lab 2023, U.S. SOC 2018) for consistent evaluation
- Validate Results: Conduct human perception studies to ensure generated content meets user authenticity expectations
LLMs vs Traditional Analytics
Large language models demonstrate superior performance compared to traditional rule-based and statistical social media analytics methods.
| Capability | LLMs | Traditional ML | Rule-based Systems |
|---|---|---|---|
| Authorship Verification | Context-aware, nuanced analysis | Feature-based classification | Keyword matching only |
| Content Generation | Human-like, contextual posts | Template-based generation | Static rule application |
| Attribute Inference | Multi-modal understanding | Statistical correlation analysis | Explicit mention detection |
| Scalability | High with API access | Moderate with training overhead | High but limited accuracy |
| Accuracy | Exceeds human performance | Good with sufficient data | Limited by rule coverage |
Risks, Limits, and Myths
- Data Contamination: Models may have seen training data similar to evaluation datasets, inflating performance metrics artificially
- Shortcut Learning: LLMs can exploit statistical correlations in test questions without genuine understanding of content [1]
- Bias Amplification: Training data biases can lead to discriminatory outcomes in user attribute inference and content generation
- Privacy Concerns: User profiling capabilities raise ethical questions about consent and data protection in social media analytics
- Evaluation Fragmentation: Current assessment landscapes show narrow benchmarks and inconsistent metrics across studies [5]
- Temporal Degradation: Model performance may decline on newer social media content due to evolving language patterns and platform changes
- Cost Scalability: API costs for large-scale social media analytics may become prohibitive for smaller organizations
FAQ
- Which LLM performs best for social media authorship verification?
- Not yet disclosed – the study evaluates seven models but specific performance rankings are not provided in the available abstract.
- How does GPT-4 compare to Gemini 1.5 Pro for social media analytics?
- Both models are evaluated across all three tasks, but comparative performance results are not yet disclosed in the study abstract.
- What makes this social media analytics evaluation different from previous studies?
- This study provides the first comprehensive evaluation across multiple tasks using fresh Twitter data from January 2024 onward to prevent data contamination.
- Can LLMs generate social media posts that users find authentic?
- The study includes human perception studies measuring real users’ acceptance of LLM-generated posts, but specific results are not yet disclosed.
- What taxonomies are used for user attribute inference evaluation?
- The study uses IAB Tech Lab 2023 and 2018 U.S. SOC (Standard Occupational Classification) taxonomies for annotating occupations and interests.
- How do researchers prevent data contamination in LLM evaluation?
- The study uses newly collected tweets from January 2024 onward and implements systematic sampling frameworks to mitigate “seen-data” bias.
- What social media analytics tasks do LLMs perform best at?
- The study evaluates three core tasks: authorship verification, post generation, and user attribute inference, but performance rankings are not yet disclosed.
- Are the evaluation benchmarks and code publicly available?
- Yes, the code and data are provided in supplementary material and will be made publicly available upon publication.
- How accurate are LLMs at inferring user occupations from social media posts?
- Specific accuracy metrics for occupation inference are not yet disclosed in the study abstract.
- What evaluation metrics measure social media post generation quality?
- The study uses comprehensive evaluation metrics assessing authenticity and user-like content quality, but specific metrics are not detailed in the abstract.
- Can smaller organizations use these LLM social media analytics methods?
- The study establishes reproducible benchmarks, but implementation costs and accessibility for smaller organizations are not yet disclosed.
- How do traditional social media analytics tools compare to LLMs?
- LLMs are benchmarked against existing baselines, but specific comparative performance results are not provided in the available abstract.
Glossary
- Authorship Verification
- The task of determining whether a specific user wrote a given social media post based on writing style and content patterns.
- IAB Tech Lab 2023
- Interactive Advertising Bureau’s standardized taxonomy for categorizing digital content and user interests, updated in 2023.
- Large Language Model (LLM)
- Deep learning models trained on massive text datasets to understand and generate human-like language for various tasks.
- Post Generation
- The automated creation of social media content that mimics authentic user writing styles and preferences.
- Shortcut Learning
- AI systems exploiting statistical patterns in test data to achieve high scores without genuine understanding of the underlying concepts.
- Social Media Analytics
- The practice of collecting and analyzing social media data to understand user behavior, content performance, and platform trends.
- U.S. SOC 2018
- United States Standard Occupational Classification system from 2018, used for categorizing job types and professional roles.
- User Attribute Inference
- The process of predicting user characteristics like occupation, interests, or demographics from their social media activity and content.
Sources
- Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
- What Are Large Language Models (LLMs)? IBM. https://www.ibm.com/think/topics/large-language-models
- Gemma 4 model card. Google AI for Developers. https://ai.google.dev/gemma/docs/core/model_card_4
- Computer Science. arXiv. https://arxiv.org/list/cs/new
- Large Language Models for Cybersecurity Intelligence: A Systematic Review. ScienceDirect. https://www.sciencedirect.com/org/science/article/pii/S1546221826003565
- Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest. arXiv:2604.18955v1. https://arxiv.org/abs/2604.18955