A comprehensive evaluation study assessed seven major large language models including GPT-4, GPT-4o, Gemini 1.5 Pro, and DeepSeek-V3 across three core social media analytics tasks using Twitter data: authorship verification, post generation, and user attribute inference.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | Comprehensive evaluation of LLMs on social media analytics tasks |
| Who it is for | Researchers and developers working with social media data |
| Where to get it | arXiv preprint |
| Price | Free |
- Seven major LLMs evaluated across three social media analytics tasks on Twitter data
- Systematic sampling framework introduced to reduce “seen-data” bias in authorship verification
- User study conducted to measure real users’ perceptions of LLM-generated posts
- Standardized taxonomies used for occupation and interest annotation in attribute inference
- Code and data will be made publicly available upon publication
- First comprehensive evaluation of modern LLMs on three core social media analytics tasks
- GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT tested
- Systematic sampling framework reduces bias from previously seen training data
- Real user study validates LLM-generated content authenticity perceptions
- Standardized taxonomies enable reproducible benchmarking across models
What is LLM Social Media Analytics
LLM social media analytics applies large language models to understand, generate, and analyze social media content. Large language models are deep learning models trained on immense amounts of data, making them capable of understanding and generating natural language [2]. These models can perform tasks like identifying post authors, generating authentic-looking social media content, and inferring user attributes from their posts.
The field combines natural language processing capabilities with social media data analysis. Models process text from platforms like Twitter to extract insights about user behavior, content authenticity, and demographic characteristics. This application leverages LLMs’ ability to understand context, writing style, and linguistic patterns specific to social media communication.
What is New vs Previous Studies
This study introduces the first comprehensive evaluation framework across multiple social media analytics tasks simultaneously. Previous research typically focused on individual tasks or limited model comparisons.
| Aspect | Previous Studies | This Study |
|---|---|---|
| Model Coverage | Limited to 1-3 models | Seven major LLMs tested |
| Task Scope | Single task focus | Three core tasks evaluated |
| Bias Mitigation | Limited temporal validation | Systematic sampling framework |
| User Validation | Automated metrics only | Real user perception study |
| Standardization | Custom annotation schemes | IAB Tech Lab and U.S. SOC taxonomies |
How Does the Evaluation Work
The evaluation framework operates through three distinct analytical tasks applied to Twitter data. Each task tests different aspects of LLM capabilities in social media contexts.
- Social Media Authorship Verification: Models determine whether specific users wrote given posts using systematic sampling across diverse user and post selection strategies
- Social Media Post Generation: LLMs generate authentic, user-like content evaluated through comprehensive metrics and real user perception studies
- User Attribute Inference: Models predict user occupations and interests using standardized IAB Tech Lab 2023 and 2018 U.S. SOC taxonomies
The framework includes temporal validation using newly collected tweets from onward to mitigate “seen-data” bias where models might have encountered training examples during evaluation.
Benchmarks and Evidence
The study establishes reproducible benchmarks across seven major language models on Twitter data. Studies find that models such as GPT-3.5 and GPT-4 can outperform crowd workers on text-annotation tasks including content classification [1].
| Model | Tasks Evaluated | Evaluation Period | Source |
|---|---|---|---|
| GPT-4 | Authorship, Generation, Inference | January 2024 onward | [Study] |
| GPT-4o | Authorship, Generation, Inference | January 2024 onward | [Study] |
| Gemini 1.5 Pro | Authorship, Generation, Inference | January 2024 onward | [Study] |
| DeepSeek-V3 | Authorship, Generation, Inference | January 2024 onward | [Study] |
| Llama 3.2 | Authorship, Generation, Inference | January 2024 onward | [Study] |
The evaluation uses standardized taxonomies for consistent measurement across models and tasks.
Who Should Care
Builders
Developers creating social media analysis tools gain standardized benchmarks for model selection and performance comparison. The systematic evaluation framework provides guidance for implementing LLM-based social media analytics systems.
Enterprise
Companies using social media monitoring and brand analysis benefit from understanding which models perform best for specific tasks. The evaluation helps inform procurement decisions for social media intelligence platforms.
End Users
Social media platform users gain insights into how AI systems might analyze their content for authorship verification and attribute inference. The user perception study reveals how authentic LLM-generated content appears to real users.
Investors
Investment decisions in AI companies focused on social media analytics can leverage these benchmarks to assess competitive positioning and technical capabilities across different model providers.
How to Use Today
The study’s code and data will be made publicly available upon publication for researchers and developers. Current access is limited to the arXiv preprint describing the methodology.
- Access the preprint at arXiv:2604.18955v1 for detailed methodology
- Review the systematic sampling framework for authorship verification tasks
- Examine the standardized taxonomies used for user attribute annotation
- Wait for public code release to implement the evaluation framework
- Apply the benchmarking methodology to your own social media datasets
The evaluation framework can be adapted for other social media platforms beyond Twitter once the implementation becomes available.
LLM vs Competitors
The study compares modern LLMs against existing baselines for social media analytics tasks. LLMs’ rapid improvement regularly renders benchmarks obsolete, with models exceeding human annotator performance [1].
| Approach | Authorship Verification | Content Generation | Attribute Inference |
|---|---|---|---|
| Traditional ML | Limited context understanding | Template-based output | Feature engineering required |
| BERT-based Models | Good classification accuracy | Limited generation capability | Strong attribute prediction |
| Modern LLMs | Contextual understanding | Human-like content creation | Zero-shot inference capability |
Risks, Limits, and Myths
- Seen-data bias: Models may perform artificially well on content encountered during training
- Platform specificity: Results may not generalize beyond Twitter to other social media platforms
- Temporal drift: Model performance may degrade as social media language evolves
- Privacy concerns: User attribute inference capabilities raise data protection questions
- Evaluation gaming: Shortcut learning allows AIs to cheat on tests using statistical correlations without considering specific questions [1]
- Limited diversity: Twitter data may not represent broader social media user populations
- Annotation quality: Standardized taxonomies may not capture nuanced user attributes accurately
FAQ
Which LLMs were tested in the social media analytics study?
The study evaluated seven models: GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT across three social media analytics tasks.
What are the three core social media analytics tasks evaluated?
The tasks are Social Media Authorship Verification, Social Media Post Generation, and User Attribute Inference, all tested on Twitter data.
How does the study address seen-data bias in LLM evaluation?
The researchers introduced a systematic sampling framework and used newly collected tweets from January 2024 onward to mitigate bias from previously seen training data.
What standardized taxonomies were used for user attribute annotation?
The study used IAB Tech Lab 2023 and 2018 U.S. SOC taxonomies for annotating user occupations and interests in the attribute inference task.
When will the code and data be publicly available?
The code and data are provided in supplementary material and will be made publicly available upon publication, though specific timing is not yet disclosed.
How was LLM-generated content authenticity measured?
The study conducted a user study to measure real users’ perceptions of LLM-generated posts conditioned on their own writing styles and comprehensive evaluation metrics.
What makes this evaluation framework different from previous studies?
This is the first comprehensive evaluation across multiple social media analytics tasks simultaneously, testing seven major LLMs with systematic bias mitigation and standardized taxonomies.
Can the evaluation framework be applied to other social media platforms?
While the current study focuses on Twitter data, the systematic evaluation framework could potentially be adapted for other social media platforms once implementation becomes available.
What are the main limitations of the study?
Key limitations include platform specificity to Twitter, potential temporal drift in model performance, and privacy concerns related to user attribute inference capabilities.
How do modern LLMs compare to traditional methods for social media analytics?
Modern LLMs demonstrate superior contextual understanding for authorship verification, human-like content generation, and zero-shot inference capability compared to traditional machine learning approaches.
Glossary
- Authorship Verification
- The task of determining whether a specific user wrote a given social media post based on writing style and content patterns
- Seen-data Bias
- Performance inflation that occurs when evaluation data was encountered during model training, leading to artificially high scores
- User Attribute Inference
- The process of predicting user characteristics like occupation or interests from their social media posts and behavior
- Systematic Sampling Framework
- A structured approach to selecting evaluation data that reduces bias through diverse user and post selection strategies
- IAB Tech Lab Taxonomy
- A standardized classification system for digital advertising content categories, used here for interest annotation
- U.S. SOC Taxonomy
- The Standard Occupational Classification system used by U.S. federal statistical agencies for categorizing occupations
- Temporal Validation
- Testing model performance on data collected after the model’s training cutoff to assess real-world generalization
Sources
- Large language model – Wikipedia. https://en.wikipedia.org/wiki/Large_language_model
- What Are Large Language Models (LLMs)? IBM. https://www.ibm.com/think/topics/large-language-models
- Gemma 4 model card. Google AI for Developers. https://ai.google.dev/gemma/docs/core/model_card_4
- Computer Science. arXiv. https://arxiv.org/list/cs/new
- Large Language Models for Cybersecurity Intelligence: A Systematic Review. ScienceDirect. https://www.sciencedirect.com/org/science/article/pii/S1546221826003565
- The 11 Best Social Media Analytics + Reporting Tools in 2026. Buffer. https://buffer.com/resources/best-social-media-analytics-tools/
- Large Language Models for Business Process Management: Opportunities and Challenges. DBLP. https://dblp.org/rec/journals/corr/abs-2304-04309.html
- AI-Driven Real-Time Data Quality Validation in Healthcare ETL Pipelines. ResearchGate. https://www.researchgate.net/publication/403917903_AI-Driven_Real-Time_Data_Quality_Validation_in_Healthcare_ETL_Pipelines