LLM Social Media Analytics Study Evaluates GPT-4, Gemini

Researchers conducted the first comprehensive evaluation of modern large language models including GPT-4, GPT-4o, and Gemini 1.5 Pro across three core social media analytics tasks using Twitter data, establishing new benchmarks for authorship verification, content generation, and user attribute inference.

Released by	Not yet disclosed
Release date	April 22, 2026
What it is	Comprehensive evaluation of LLMs on social media analytics tasks
Who it is for	Researchers and developers working on social media AI
Where to get it	arXiv preprint
Price	Free

Seven major LLMs were evaluated on Twitter data across authorship verification, content generation, and user attribute inference tasks
The study introduced systematic sampling frameworks to mitigate “seen-data” bias using tweets from January 2024 onward
Researchers conducted user studies measuring real users’ perceptions of LLM-generated posts conditioned on their own writing
User attributes were annotated using standardized taxonomies including IAB Tech Lab 2023 and 2018 U.S. SOC classifications
The evaluation establishes reproducible benchmarks with code and data to be made publicly available upon publication

What is LLM Social Media Analytics
What is New vs Previous Evaluations
How Does the Evaluation Work
Benchmarks and Evidence
Who Should Care
How to Access Today
Comparison vs Competitors
Risks, Limits, and Myths

This represents the first comprehensive multi-task evaluation of modern LLMs on social media analytics using standardized benchmarks
The study addresses “seen-data” bias by evaluating on newly collected tweets from January 2024 onward
Seven major models were tested: GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT
The evaluation framework covers three core tasks: authorship verification, content generation, and user attribute inference
Standardized taxonomies (IAB Tech Lab 2023 and 2018 U.S. SOC) were used for consistent attribute annotation

LLM social media analytics involves using large language models to understand, generate, and analyze social media content for tasks like authorship verification and user profiling. [1] Large language models are deep learning systems trained on immense amounts of data, making them capable of understanding and generating natural language content. [2]

Social media analytics with LLMs encompasses three primary capabilities. First, authorship verification determines whether specific users wrote particular posts by analyzing writing patterns and linguistic features. Second, content generation creates authentic, user-like posts that match individual writing styles and preferences. Third, user attribute inference extracts demographic and behavioral information from social media posts to build user profiles.

The technology leverages LLMs’ natural language understanding to process unstructured social media text at scale. Models can identify subtle linguistic patterns, sentiment, and contextual cues that traditional rule-based systems miss. This enables more sophisticated analysis of user behavior, content authenticity, and social trends across platforms like Twitter, Facebook, and Instagram.

What is New vs Previous Evaluations

This study introduces the first comprehensive multi-task evaluation framework specifically designed for social media analytics using modern LLMs. Previous evaluations typically focused on single tasks or used older model architectures without systematic bias mitigation.

Aspect	Previous Evaluations	This Study
Model Coverage	Limited to 1-3 models	Seven major LLMs including GPT-4, Gemini 1.5 Pro
Task Scope	Single-task focus	Three integrated social media analytics tasks
Bias Mitigation	No systematic approach	Sampling framework using post-January 2024 tweets
User Studies	Rare or absent	Real user perception measurements included
Standardization	Custom taxonomies	IAB Tech Lab 2023 and U.S. SOC classifications
Reproducibility	Limited code/data sharing	Public code and data release planned

The evaluation addresses “seen-data” bias by using tweets collected from January 2024 onward, ensuring models haven’t encountered the test data during training. This temporal separation provides more reliable performance measurements than evaluations using older, potentially seen content.

How Does the Evaluation Work

The evaluation framework operates through three integrated tasks that comprehensively assess LLM capabilities in social media contexts.

Social Media Authorship Verification: Models determine whether specific users authored given posts using systematic sampling across diverse user and post selection strategies
Social Media Post Generation: LLMs generate authentic, user-like content evaluated using comprehensive metrics measuring authenticity and style consistency
User Attribute Inference: Models extract occupations and interests from posts, benchmarked against existing baselines using standardized taxonomies

The methodology includes user studies bridging generation and verification tasks. Real users evaluate LLM-generated posts conditioned on their own writing styles, providing human perception data alongside automated metrics. This dual evaluation approach captures both technical performance and user acceptance.

Sampling frameworks ensure robust evaluation across different user types and content categories. The study uses diverse selection strategies to avoid bias toward specific demographics or writing styles, creating more generalizable performance measurements.

Benchmarks and Evidence

The study establishes new benchmarks across three social media analytics tasks using standardized evaluation metrics and human validation.

Task	Evaluation Method	Key Metrics	Source
Authorship Verification	Systematic sampling framework	Accuracy on post-January 2024 tweets	Study methodology
Content Generation	Comprehensive evaluation metrics	Authenticity and user-likeness scores	Study methodology
User Attribute Inference	Standardized taxonomy annotation	Performance vs existing baselines	Study methodology
Human Perception	User study evaluation	Real user acceptance ratings	Study methodology

Studies show that models like GPT-3.5 and GPT-4 can outperform crowd workers on text annotation tasks including content moderation and classification. [1] However, LLMs’ rapid improvement regularly renders benchmarks obsolete, with models exceeding human annotator performance. [1]

The evaluation uses IAB Tech Lab 2023 and 2018 U.S. Standard Occupational Classification taxonomies for consistent attribute annotation. This standardization enables reproducible comparisons across different studies and model versions.

Who Should Care

Builders

Developers creating social media analysis tools gain standardized benchmarks for evaluating LLM performance across key tasks. The public code and data release enables rapid prototyping and comparison testing for new applications.

Enterprise

Companies using social media analytics for marketing, customer service, or brand monitoring can assess which LLMs best suit their specific use cases. The multi-task evaluation helps inform model selection decisions for production deployments.

End Users

Social media users benefit from improved content authenticity detection and more sophisticated user profiling systems. The human perception studies ensure generated content meets user expectations for quality and authenticity.

Investors

Investment decisions in social media AI companies can leverage these benchmarks to evaluate technical capabilities and market positioning. The comprehensive evaluation framework provides objective performance comparisons across major model providers.

How to Access Today

The research is currently available as an arXiv preprint with full methodology and initial results.

Access the paper at arXiv:2604.18955v1 for complete methodology and evaluation framework details
Review the systematic sampling approaches for authorship verification and bias mitigation strategies
Examine the comprehensive evaluation metrics used for content generation assessment
Study the standardized taxonomy implementations for user attribute inference
Await public release of code and data materials upon formal publication

The supplementary materials include implementation details and dataset specifications. Researchers can replicate the evaluation framework using the provided methodological descriptions before full code release.

Comparison vs Competitors

Model	Authorship Verification	Content Generation	Attribute Inference	Overall Performance
GPT-4	Not yet disclosed	Not yet disclosed	Not yet disclosed	Not yet disclosed
GPT-4o	Not yet disclosed	Not yet disclosed	Not yet disclosed	Not yet disclosed
Gemini 1.5 Pro	Not yet disclosed	Not yet disclosed	Not yet disclosed	Not yet disclosed
DeepSeek-V3	Not yet disclosed	Not yet disclosed	Not yet disclosed	Not yet disclosed
Llama 3.2	Not yet disclosed	Not yet disclosed	Not yet disclosed	Not yet disclosed

The evaluation includes BERT as a baseline model representing earlier transformer architectures. GPT-3.5-Turbo provides comparison with previous-generation OpenAI models, while the inclusion of open-source options like Llama 3.2 enables cost-benefit analysis.

Model performance varies across tasks, with some excelling at content generation while others perform better at analytical tasks like attribute inference. The comprehensive evaluation reveals these task-specific strengths and weaknesses.

Risks, Limits, and Myths

Shortcut Learning Risk: LLMs may “cheat” on evaluation tasks by using statistical correlations in question wording rather than genuine understanding [1]
Training Data Bias: Model performance depends heavily on training data quality and diversity, with biases or gaps leading to response limitations [3]
Temporal Generalization: Models may struggle with social media content that differs significantly from their training data time periods
Platform Specificity: Evaluation focuses on Twitter data, which may not generalize to other social media platforms with different user behaviors
Privacy Concerns: Social media analytics raise ethical questions about user consent and data usage for profiling purposes
Evaluation Scope Limits: The study covers three tasks but social media analytics encompasses many additional use cases not evaluated

FAQ

Which large language models were evaluated in the social media analytics study?

The study evaluated seven major LLMs: GPT-4, GPT-4o, GPT-3.5-Turbo, Gemini 1.5 Pro, DeepSeek-V3, Llama 3.2, and BERT across three social media analytics tasks.

What are the three main social media analytics tasks tested?

The three core tasks are Social Media Authorship Verification, Social Media Post Generation, and User Attribute Inference using Twitter dataset content.

How does the study address seen-data bias in LLM evaluation?

Researchers used systematic sampling frameworks and evaluated generalization on newly collected tweets from January 2024 onward to mitigate seen-data bias.

What standardized taxonomies were used for user attribute annotation?

The study used IAB Tech Lab 2023 and 2018 U.S. Standard Occupational Classification (SOC) taxonomies for consistent occupation and interest annotation.

When will the code and data be publicly available?

The code and data are provided in supplementary material and will be made publicly available upon formal publication of the research.

How were user perceptions of LLM-generated content measured?

Researchers conducted user studies measuring real users’ perceptions of LLM-generated posts conditioned on their own writing styles and preferences.

What makes this evaluation different from previous LLM studies?

This represents the first comprehensive multi-task evaluation of modern LLMs specifically for social media analytics with systematic bias mitigation and standardized taxonomies.

Which social media platform data was used for the evaluation?

The evaluation used Twitter (X) dataset content across all three social media analytics tasks for consistent platform-specific assessment.

How does authorship verification work in the study?

Authorship verification uses systematic sampling over diverse user and post selection strategies to determine whether specific users wrote particular social media posts.

What are the practical applications of this research?

Applications include content authenticity detection, automated user profiling, social media monitoring, and improving AI-generated content quality for marketing purposes.

Glossary

Authorship Verification: The process of determining whether a specific user wrote a particular social media post by analyzing writing patterns and linguistic features
Content Generation: The creation of authentic, user-like social media posts that match individual writing styles and preferences using AI models
IAB Tech Lab: Interactive Advertising Bureau Technology Laboratory, which provides standardized taxonomies for digital advertising and content classification
Large Language Model (LLM): Deep learning models trained on vast amounts of text data to understand and generate human-like language across various tasks
Seen-Data Bias: Evaluation bias that occurs when AI models have encountered test data during training, leading to artificially inflated performance scores
Systematic Sampling: A structured approach to selecting representative data points that ensures diverse coverage across different user types and content categories
U.S. SOC: United States Standard Occupational Classification system used by federal agencies to classify workers into occupational categories
User Attribute Inference: The extraction of demographic, behavioral, and preference information from social media posts to build comprehensive user profiles

Access the full research paper at arXiv:2604.18955v1 to review the complete methodology and prepare for the upcoming public release of evaluation code and datasets.

Sources

Large language model – Wikipedia. Available at: https://en.wikipedia.org/wiki/Large_language_model
What Are Large Language Models (LLMs)? IBM. Available at: https://www.ibm.com/think/topics/large-language-models
Gemma 4 model card. Google AI for Developers. Available at: https://ai.google.dev/gemma/docs/core/model_card_4

Author

siego237

Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.