A new study evaluates seven major large language models including GPT-4, GPT-4o, Gemini 1.5 Pro, and DeepSeek-V3 across three core social media analytics tasks using Twitter data. The research introduces systematic benchmarks for authorship verification, post generation, and user attribute inference.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | Comprehensive evaluation of modern LLMs across three social media analytics tasks |
| Who it is for | AI researchers and social media analysts |
| Where to get it | arXiv preprint |
| Price | Free |
- Seven LLMs tested: GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT
- Three evaluation tasks: authorship verification, post generation, and user attribute inference
- Uses Twitter dataset with tweets from onward to prevent data contamination
- Includes user study measuring real users’ perceptions of AI-generated posts
- Establishes reproducible benchmarks with standardized taxonomies for occupation and interest classification
- This represents the first comprehensive multi-task evaluation of modern LLMs for social media analytics
- The study addresses data contamination bias by using fresh Twitter data from onward
- User attribute inference uses standardized taxonomies from IAB Tech Lab 2023 and 2018 U.S. SOC classifications
- Real user studies validate the authenticity of LLM-generated social media content
- Code and data will be publicly available upon publication for reproducible research
What is LLM Social Media Analytics
LLM social media analytics applies large language models to understand, generate, and analyze social media content automatically. [1] Large language models are deep learning systems trained on massive datasets that can understand and generate natural language for various tasks. [2]
The field encompasses three primary capabilities: verifying who authored specific posts, generating authentic-looking social media content, and inferring user characteristics from their posting patterns. These applications leverage LLMs’ ability to process text at scale and identify subtle patterns in writing style and content preferences.
What is New vs Previous Studies
This study introduces the first unified evaluation framework testing multiple state-of-the-art LLMs across three interconnected social media tasks simultaneously.
| Aspect | Previous Studies | This Study |
|---|---|---|
| Model Coverage | Single or few models | Seven major LLMs including GPT-4, GPT-4o, Gemini 1.5 Pro |
| Task Scope | Individual tasks | Three interconnected tasks with unified evaluation |
| Data Freshness | Potential contamination | Fresh tweets from January 2024 onward |
| User Validation | Limited human evaluation | Real user studies on generated content authenticity |
| Standardization | Custom taxonomies | IAB Tech Lab 2023 and U.S. SOC 2018 classifications |
How Does the Evaluation Work
The evaluation framework systematically tests LLMs across three core tasks using standardized methodologies and fresh data.
- Authorship Verification: Models determine whether specific users wrote given posts using diverse sampling strategies across different user types and post characteristics.
- Post Generation: LLMs create authentic-looking social media content that matches individual user writing styles and preferences.
- User Attribute Inference: Models predict user occupations and interests from posting patterns using standardized IAB Tech Lab 2023 and U.S. SOC 2018 taxonomies.
- Cross-Task Validation: Generated posts from task two are evaluated in task one to measure consistency across capabilities.
- Human Evaluation: Real users assess the authenticity of AI-generated posts conditioned on their own writing styles.
Benchmarks and Evidence
The study establishes comprehensive benchmarks across multiple dimensions of social media analytics performance.
| Evaluation Metric | Task Application | Source |
|---|---|---|
| Generalization on fresh data | Authorship verification | Study methodology [Study] |
| Content authenticity scores | Post generation | Comprehensive evaluation metrics [Study] |
| IAB Tech Lab 2023 taxonomy | Interest classification | Standardized taxonomy [Study] |
| U.S. SOC 2018 classification | Occupation inference | Standardized taxonomy [Study] |
| User perception ratings | Generated content validation | Human evaluation study [Study] |
Who Should Care
Builders
AI developers building social media tools gain standardized benchmarks for evaluating model performance across multiple tasks. The reproducible evaluation framework enables systematic comparison of different LLM architectures and training approaches.
Enterprise
Social media platforms and marketing companies can assess which LLMs best suit their content moderation, user analysis, and content generation needs. The benchmarks provide evidence-based guidance for model selection and deployment strategies.
End Users
Social media users benefit from improved content authenticity detection and more sophisticated platform features powered by better-evaluated AI systems. The study’s focus on user perception ensures AI-generated content meets human expectations.
Investors
Investment decisions in AI companies can leverage these benchmarks to evaluate technical capabilities and market positioning. The comprehensive evaluation reveals which models excel at commercially valuable social media analytics tasks.
How to Use These Findings Today
Researchers and practitioners can immediately apply these evaluation methodologies to their own social media analytics projects.
- Download the dataset: Access the Twitter dataset and evaluation code from the supplementary materials upon publication.
- Implement sampling framework: Use the systematic sampling strategies for authorship verification tasks in your own applications.
- Apply standardized taxonomies: Integrate IAB Tech Lab 2023 and U.S. SOC 2018 classifications for consistent user attribute inference.
- Conduct user studies: Follow the human evaluation methodology to validate AI-generated content authenticity.
- Benchmark your models: Compare your LLM performance against the established baselines across all three tasks.
LLM vs Traditional Methods
Large language models demonstrate superior performance compared to traditional machine learning approaches in social media analytics tasks.
| Approach | Authorship Detection | Content Generation | Attribute Inference |
|---|---|---|---|
| Traditional ML | Rule-based features | Template systems | Manual feature engineering |
| BERT (baseline LLM) | Contextual embeddings | Limited generation | Pre-trained representations |
| Modern LLMs | Multi-task learning | Human-like generation | Zero-shot inference |
Risks, Limits, and Myths
- Data contamination risk: Even with fresh data from , some models may have seen similar patterns during training
- Platform specificity: Results focus on Twitter/X data and may not generalize to other social media platforms with different user behaviors
- Evaluation bias: Human evaluators may have unconscious preferences that affect authenticity ratings of generated content
- Temporal drift: Social media language evolves rapidly, potentially making benchmarks obsolete as user communication patterns change
- Privacy concerns: User attribute inference capabilities raise ethical questions about privacy and consent in social media analysis
- Myth: Perfect accuracy: No LLM achieves 100% accuracy across all tasks, and performance varies significantly by specific use case
FAQ
Which LLMs were tested in the social media analytics study?
The study evaluated seven models: GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT across three social media analytics tasks.
What are the three main tasks evaluated in the study?
The three core tasks are social media authorship verification, social media post generation, and user attribute inference using Twitter dataset.
How does the study prevent data contamination bias?
Researchers used fresh Twitter data from onward to minimize the risk of models having seen the evaluation data during training.
What standardized taxonomies are used for user attribute classification?
The study uses IAB Tech Lab 2023 taxonomy for interest classification and 2018 U.S. SOC (Standard Occupational Classification) for occupation inference.
How do researchers validate the authenticity of AI-generated posts?
The study includes human evaluation where real users assess the authenticity of LLM-generated posts conditioned on their own writing styles.
When will the code and data be publicly available?
The researchers state that code and data are provided in supplementary materials and will be made publicly available upon publication.
What makes this evaluation framework different from previous studies?
This represents the first comprehensive multi-task evaluation of modern LLMs for social media analytics, using fresh data and standardized taxonomies.
Can these benchmarks be applied to other social media platforms?
While the study focuses on Twitter/X data, the evaluation methodology and frameworks can potentially be adapted for other social media platforms.
What are the main applications of LLM social media analytics?
Key applications include content moderation, user behavior analysis, automated content generation, and demographic inference for marketing and research purposes.
How do LLMs compare to traditional methods in social media analytics?
LLMs generally outperform traditional machine learning approaches by leveraging contextual understanding and multi-task learning capabilities across social media analytics tasks.
Glossary
- Authorship Verification
- The task of determining whether a specific user wrote a given social media post based on writing style and content patterns
- Data Contamination
- When evaluation data appears in training datasets, leading to artificially inflated performance scores that don’t reflect real-world capabilities
- IAB Tech Lab Taxonomy
- Standardized classification system for digital content categories and user interests developed by the Interactive Advertising Bureau
- Large Language Model (LLM)
- Deep learning models trained on massive text datasets to understand and generate human-like language across various tasks
- Social Media Analytics
- The practice of collecting and analyzing social media data to understand user behavior, content patterns, and platform dynamics
- U.S. SOC Classification
- Standard Occupational Classification system used by federal statistical agencies to classify workers into occupational categories
- User Attribute Inference
- The process of predicting user characteristics like demographics, interests, or occupations from their social media activity patterns
- Zero-shot Inference
- AI model’s ability to perform tasks without specific training examples, using only general language understanding capabilities
Sources
- Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
- What Are Large Language Models (LLMs)? IBM. https://www.ibm.com/think/topics/large-language-models
- Assessing Capabilities of Large Language Models in Social Media Analytics: A Multi-task Quest. arXiv:2604.18955v1. https://arxiv.org/abs/2604.18955