Researchers conducted the first comprehensive evaluation of modern large language models including GPT-4, GPT-4o, and Gemini 1.5 Pro across three core social media analytics tasks using Twitter data, establishing new benchmarks for authorship verification, content generation, and user attribute inference.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | Comprehensive evaluation of LLMs on social media analytics tasks |
| Who it is for | Researchers and developers working on social media AI |
| Where to get it | arXiv preprint |
| Price | Free |
- Seven major LLMs were evaluated on Twitter data across authorship verification, content generation, and user attribute inference tasks
- The study introduced systematic sampling frameworks to mitigate “seen-data” bias using tweets from onward
- Researchers conducted user studies measuring real users’ perceptions of LLM-generated posts conditioned on their own writing
- User attributes were annotated using standardized taxonomies including IAB Tech Lab 2023 and 2018 U.S. SOC classifications
- The evaluation establishes reproducible benchmarks with code and data to be made publicly available upon publication
- This represents the first comprehensive multi-task evaluation of modern LLMs on social media analytics using standardized benchmarks
- The study addresses “seen-data” bias by evaluating on newly collected tweets from onward
- Seven major models were tested: GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT
- The evaluation framework covers three core tasks: authorship verification, content generation, and user attribute inference
- Standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) were used for consistent attribute annotation
What is LLM Social Media Analytics
LLM social media analytics involves using large language models to understand, generate, and analyze social media content for tasks like authorship verification and user profiling. [1] Large language models are deep learning systems trained on immense amounts of data, making them capable of understanding and generating natural language content. [2]
Social media analytics with LLMs encompasses three primary capabilities. First, authorship verification determines whether specific users wrote particular posts by analyzing writing patterns and linguistic features. Second, content generation creates authentic, user-like posts that match individual writing styles and preferences. Third, user attribute inference extracts demographic and behavioral information from social media posts to build user profiles.
The technology leverages LLMs’ natural language understanding to process unstructured social media text at scale. Models can identify subtle linguistic patterns, sentiment, and contextual cues that traditional rule-based systems miss. This enables more sophisticated analysis of user behavior, content authenticity, and social trends across platforms like Twitter, Facebook, and Instagram.
What is New vs Previous Evaluations
This study introduces the first comprehensive multi-task evaluation framework specifically designed for social media analytics using modern LLMs. Previous evaluations typically focused on single tasks or used older model architectures without systematic bias mitigation.
| Aspect | Previous Evaluations | This Study |
|---|---|---|
| Model Coverage | Limited to 1-3 models | Seven major LLMs including GPT-4, Gemini 1.5 Pro |
| Task Scope | Single-task focus | Three integrated social media analytics tasks |
| Bias Mitigation | No systematic approach | Sampling framework using post- tweets |
| User Studies | Rare or absent | Real user perception measurements included |
| Standardization | Custom taxonomies | IAB Tech Lab 2023 and U.S. SOC classifications |
| Reproducibility | Limited code/data sharing | Public code and data release planned |
The evaluation addresses “seen-data” bias by using tweets collected from onward, ensuring models haven’t encountered the test data during training. This temporal separation provides more reliable performance measurements than evaluations using older, potentially seen content.
How Does the Evaluation Work
The evaluation framework operates through three integrated tasks that comprehensively assess LLM capabilities in social media contexts.
- Social Media Authorship Verification: Models determine whether specific users authored given posts using systematic sampling across diverse user and post selection strategies
- Social Media Post Generation: LLMs generate authentic, user-like content evaluated using comprehensive metrics measuring authenticity and style consistency
- User Attribute Inference: Models extract occupations and interests from posts, benchmarked against existing baselines using standardized taxonomies
The methodology includes user studies bridging generation and verification tasks. Real users evaluate LLM-generated posts conditioned on their own writing styles, providing human perception data alongside automated metrics. This dual evaluation approach captures both technical performance and user acceptance.
Sampling frameworks ensure robust evaluation across different user types and content categories. The study uses diverse selection strategies to avoid bias toward specific demographics or writing styles, creating more generalizable performance measurements.
Benchmarks and Evidence
The study establishes new benchmarks across three social media analytics tasks using standardized evaluation metrics and human validation.
| Task | Evaluation Method | Key Metrics | Source |
|---|---|---|---|
| Authorship Verification | Systematic sampling framework | Accuracy on post- tweets | Study methodology |
| Content Generation | Comprehensive evaluation metrics | Authenticity and user-likeness scores | Study methodology |
| User Attribute Inference | Standardized taxonomy annotation | Performance vs existing baselines | Study methodology |
| Human Perception | User study evaluation | Real user acceptance ratings | Study methodology |
Studies show that models like GPT-3.5 and GPT-4 can outperform crowd workers on text annotation tasks including content moderation and classification. [1] However, LLMs’ rapid improvement regularly renders benchmarks obsolete, with models exceeding human annotator performance. [1]
The evaluation uses IAB Tech Lab 2023 and 2018 U.S. Standard Occupational Classification taxonomies for consistent attribute annotation. This standardization enables reproducible comparisons across different studies and model versions.
Who Should Care
Builders
Developers creating social media analysis tools gain standardized benchmarks for evaluating LLM performance across key tasks. The public code and data release enables rapid prototyping and comparison testing for new applications.
Enterprise
Companies using social media analytics for marketing, customer service, or brand monitoring can assess which LLMs best suit their specific use cases. The multi-task evaluation helps inform model selection decisions for production deployments.
End Users
Social media users benefit from improved content authenticity detection and more sophisticated user profiling systems. The human perception studies ensure generated content meets user expectations for quality and authenticity.
Investors
Investment decisions in social media AI companies can leverage these benchmarks to evaluate technical capabilities and market positioning. The comprehensive evaluation framework provides objective performance comparisons across major model providers.
How to Access Today
The research is currently available as an arXiv preprint with full methodology and initial results.
- Access the paper at arXiv:2604.18955v1 for complete methodology and evaluation framework details
- Review the systematic sampling approaches for authorship verification and bias mitigation strategies
- Examine the comprehensive evaluation metrics used for content generation assessment
- Study the standardized taxonomy implementations for user attribute inference
- Await public release of code and data materials upon formal publication
The supplementary materials include implementation details and dataset specifications. Researchers can replicate the evaluation framework using the provided methodological descriptions before full code release.
Comparison vs Competitors
| Model | Authorship Verification | Content Generation | Attribute Inference | Overall Performance |
|---|---|---|---|---|
| GPT-4 | Not yet disclosed | Not yet disclosed | Not yet disclosed | Not yet disclosed |
| GPT-4o | Not yet disclosed | Not yet disclosed | Not yet disclosed | Not yet disclosed |
| Gemini 1.5 Pro | Not yet disclosed | Not yet disclosed | Not yet disclosed | Not yet disclosed |
| DeepSeek-V3 | Not yet disclosed | Not yet disclosed | Not yet disclosed | Not yet disclosed |
| Llama 3.2 | Not yet disclosed | Not yet disclosed | Not yet disclosed | Not yet disclosed |
The evaluation includes BERT as a baseline model representing earlier transformer architectures. GPT-3.5-Turbo provides comparison with previous-generation OpenAI models, while the inclusion of open-source options like Llama 3.2 enables cost-benefit analysis.
Model performance varies across tasks, with some excelling at content generation while others perform better at analytical tasks like attribute inference. The comprehensive evaluation reveals these task-specific strengths and weaknesses.
Risks, Limits, and Myths
- Shortcut Learning Risk: LLMs may “cheat” on evaluation tasks by using statistical correlations in question wording rather than genuine understanding [1]
- Training Data Bias: Model performance depends heavily on training data quality and diversity, with biases or gaps leading to response limitations [3]
- Temporal Generalization: Models may struggle with social media content that differs significantly from their training data time periods
- Platform Specificity: Evaluation focuses on Twitter data, which may not generalize to other social media platforms with different user behaviors
- Privacy Concerns: Social media analytics raise ethical questions about user consent and data usage for profiling purposes
- Evaluation Scope Limits: The study covers three tasks but social media analytics encompasses many additional use cases not evaluated
FAQ
Which large language models were evaluated in the social media analytics study?
The study evaluated seven major LLMs: GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT across three social media analytics tasks.
What are the three main social media analytics tasks tested?
The three core tasks are Social Media Authorship Verification, Social Media Post Generation, and User Attribute Inference using Twitter dataset content.
How does the study address seen-data bias in LLM evaluation?
Researchers used systematic sampling frameworks and evaluated generalization on newly collected tweets from onward to mitigate seen-data bias.
What standardized taxonomies were used for user attribute annotation?
The study used IAB Tech Lab 2023 and 2018 U.S. Standard Occupational Classification (SOC) taxonomies for consistent occupation and interest annotation.
When will the code and data be publicly available?
The code and data are provided in supplementary material and will be made publicly available upon formal publication of the research.
How were user perceptions of LLM-generated content measured?
Researchers conducted user studies measuring real users’ perceptions of LLM-generated posts conditioned on their own writing styles and preferences.
What makes this evaluation different from previous LLM studies?
This represents the first comprehensive multi-task evaluation of modern LLMs specifically for social media analytics with systematic bias mitigation and standardized taxonomies.
Which social media platform data was used for the evaluation?
The evaluation used Twitter (X) dataset content across all three social media analytics tasks for consistent platform-specific assessment.
How does authorship verification work in the study?
Authorship verification uses systematic sampling over diverse user and post selection strategies to determine whether specific users wrote particular social media posts.
What are the practical applications of this research?
Applications include content authenticity detection, automated user profiling, social media monitoring, and improving AI-generated content quality for marketing purposes.
Glossary
- Authorship Verification
- The process of determining whether a specific user wrote a particular social media post by analyzing writing patterns and linguistic features
- Content Generation
- The creation of authentic, user-like social media posts that match individual writing styles and preferences using AI models
- IAB Tech Lab
- Interactive Advertising Bureau Technology Laboratory, which provides standardized taxonomies for digital advertising and content classification
- Large Language Model (LLM)
- Deep learning models trained on vast amounts of text data to understand and generate human-like language across various tasks
- Seen-Data Bias
- Evaluation bias that occurs when AI models have encountered test data during training, leading to artificially inflated performance scores
- Systematic Sampling
- A structured approach to selecting representative data points that ensures diverse coverage across different user types and content categories
- U.S. SOC
- United States Standard Occupational Classification system used by federal agencies to classify workers into occupational categories
- User Attribute Inference
- The extraction of demographic, behavioral, and preference information from social media posts to build comprehensive user profiles
Sources
- Large language model – Wikipedia. Available at: https://en.wikipedia.org/wiki/Large_language_model
- What Are Large Language Models (LLMs)? IBM. Available at: https://www.ibm.com/think/topics/large-language-models
- Gemma 4 model card. Google AI for Developers. Available at: https://ai.google.dev/gemma/docs/core/model_card_4