DialToM is a new Theory of Mind benchmark that tests whether AI models can predict dialogue trajectories based on mental state understanding. Research reveals most LLMs excel at identifying mental states but fail to forecast social interactions, except Gemini 3 Pro.
| Released by | Not yet disclosed |
|---|---|
| Release date | |
| What it is | Theory of Mind benchmark for testing AI dialogue trajectory forecasting |
| Who it is for | AI researchers evaluating language model reasoning capabilities |
| Where to get it | GitHub repository at https://github.com/Stealth-py/DialToM |
| Price | Free |
- DialToM benchmark tests AI models’ ability to predict dialogue trajectories using mental state understanding
- Research reveals significant reasoning asymmetry in current LLMs between state identification and trajectory forecasting
- Only Gemini 3 Pro successfully leverages mental state understanding to forecast social trajectories
- Benchmark includes both Literal ToM (mental state prediction) and Functional ToM (practical application) evaluation
- Dataset and evaluation code are publicly available for reproducible research
- Most LLMs can identify mental states but cannot use this knowledge for practical dialogue forecasting
- Gemini 3 Pro stands alone in demonstrating functional Theory of Mind capabilities
- Human and LLM-generated inferences show only weak semantic similarities
- The benchmark measures both literal mental state prediction and functional application abilities
- DialToM uses natural human dialogue data with human verification for authenticity
What is DialToM
DialToM is a human-verified benchmark that evaluates AI models’ Theory of Mind abilities through dialogue trajectory forecasting. The benchmark measures the predictability of mental states including the trust factor between speakers [2]. DialToM uses diagnostic probing to evaluate if agents can leverage these states to forecast state-consistent future dialogue trajectories.
The benchmark distinguishes between Literal ToM (mental state prediction) and Functional ToM (practical utility of these states). This dual approach reveals whether AI models possess robust reasoning capabilities or rely on spurious correlations when demonstrating Theory of Mind abilities.
What is new vs the previous version
DialToM represents a novel approach to Theory of Mind evaluation rather than an update to existing benchmarks.
| Innovation | Description |
|---|---|
| Prospective Diagnostic Forecasting | Tests whether models can identify state-consistent dialogue trajectories from mental-state profiles |
| Natural Human Dialogue | Built from authentic human conversations with human verification |
| Dual ToM Assessment | Evaluates both literal mental state prediction and functional application |
| Multiple-Choice Framework | Structured evaluation format for consistent measurement |
| Trust Factor Integration | Includes trust dynamics between speakers as measurable mental states |
How does DialToM work
DialToM operates through a systematic evaluation process that tests both mental state identification and practical application.
- Mental State Extraction: Models identify mental states from natural dialogue contexts including trust factors between speakers
- Literal ToM Testing: Benchmark evaluates accuracy of mental state predictions using multiple-choice questions
- Functional ToM Assessment: Models must forecast dialogue trajectories that remain consistent with identified mental states
- Prospective Diagnostic Probing: System tests whether models can select appropriate future dialogue paths based solely on mental-state profiles
- Semantic Similarity Analysis: Benchmark compares human and LLM-generated inferences to measure reasoning quality
Benchmarks and evidence
Research reveals significant performance gaps between mental state identification and trajectory forecasting across major LLMs.
| Finding | Evidence | Source |
|---|---|---|
| Reasoning Asymmetry | LLMs excel at identifying mental states but most fail to forecast social trajectories | [1] |
| Gemini 3 Pro Exception | Only Gemini 3 Pro successfully leverages mental state understanding for trajectory forecasting | [1] |
| Weak Semantic Similarity | Only weak semantic similarities found between human and LLM-generated inferences | [1] |
| Trust Factor Measurement | Benchmark measures predictability of mental states including trust between speakers | [2] |
Who should care
Builders
AI developers building conversational systems need DialToM to evaluate Theory of Mind capabilities beyond surface-level performance. The benchmark reveals whether models can apply mental state understanding to practical dialogue scenarios.
Enterprise
Companies deploying AI for customer service, therapy, or social interaction applications require models with functional Theory of Mind abilities. DialToM identifies which models can navigate complex social dynamics effectively.
End users
Users interacting with AI assistants benefit from systems that understand and predict social trajectories. DialToM helps identify AI models capable of more nuanced, contextually appropriate responses.
Investors
Investment decisions in AI companies should consider Theory of Mind capabilities as a competitive differentiator. DialToM provides objective measurement of this increasingly important AI capability.
How to use DialToM today
Researchers can access DialToM through the publicly available GitHub repository for immediate evaluation.
- Access Repository: Visit https://github.com/Stealth-py/DialToM to download the dataset and evaluation code
- Install Dependencies: Follow repository instructions to set up the evaluation environment
- Load Dataset: Import the human-verified dialogue data with mental state annotations
- Run Evaluation: Execute benchmark tests on target LLMs using the provided multiple-choice framework
- Analyze Results: Compare Literal ToM and Functional ToM performance across different models
DialToM vs competitors
DialToM introduces unique capabilities not found in existing Theory of Mind benchmarks.
| Feature | DialToM | Traditional ToM Benchmarks | Dialogue Act Prediction |
|---|---|---|---|
| Natural Dialogue Data | Yes, human-verified | Often synthetic scenarios | Yes, corpus-derived |
| Trajectory Forecasting | Yes, state-consistent paths | No, static evaluation | Limited to next acts |
| Trust Factor Measurement | Yes, integrated | No | No |
| Functional ToM Testing | Yes, practical application | No, literal only | No |
| Prospective Probing | Yes, diagnostic forecasting | No | No |
Risks, limits, and myths
- Limited Model Coverage: Current evaluation focuses on major LLMs, smaller models may show different patterns
- Cultural Bias: Human dialogue data may reflect specific cultural contexts affecting generalizability
- Evaluation Complexity: Multiple-choice format may not capture full spectrum of Theory of Mind reasoning
- Myth – Surface Performance: High literal ToM scores do not guarantee functional Theory of Mind capabilities
- Myth – Universal Failure: Not all LLMs fail functional ToM testing, Gemini 3 Pro demonstrates success
- Dataset Size Constraints: Benchmark size limitations may affect statistical significance of results
FAQ
What is the DialToM benchmark and how does it work?
DialToM is a Theory of Mind benchmark that tests AI models’ ability to predict dialogue trajectories based on mental state understanding using natural human dialogue data.
Which AI models perform best on DialToM benchmark?
Gemini 3 Pro is the only model that successfully demonstrates functional Theory of Mind by leveraging mental state understanding to forecast social trajectories.
How is DialToM different from other Theory of Mind tests?
DialToM uniquely combines literal mental state prediction with functional application testing through prospective diagnostic forecasting of dialogue trajectories.
What does reasoning asymmetry mean in DialToM results?
Reasoning asymmetry refers to LLMs excelling at identifying mental states but failing to use this understanding for practical dialogue trajectory forecasting.
Can I access the DialToM dataset for my research?
Yes, the DialToM dataset and evaluation code are publicly available at https://github.com/Stealth-py/DialToM for reproducible research.
What is Functional ToM versus Literal ToM in DialToM?
Literal ToM measures mental state prediction accuracy while Functional ToM evaluates practical utility of these states for forecasting dialogue trajectories.
How does DialToM measure trust between dialogue participants?
DialToM measures the predictability of mental states including trust factors between speakers as part of its comprehensive Theory of Mind evaluation.
What are the limitations of the DialToM benchmark?
DialToM limitations include potential cultural bias in human dialogue data, multiple-choice format constraints, and limited coverage of smaller language models.
Why do most LLMs fail at functional Theory of Mind?
Most LLMs demonstrate spurious correlations rather than robust reasoning, enabling mental state identification but preventing practical application for trajectory forecasting.
How can developers use DialToM to improve their AI systems?
Developers can use DialToM to identify Theory of Mind weaknesses in their models and focus training on functional application of mental state understanding.
Glossary
- Theory of Mind (ToM)
- The ability to understand that others have beliefs, desires, and intentions different from one’s own
- Literal ToM
- The capacity to identify and predict mental states in dialogue contexts
- Functional ToM
- The ability to practically apply mental state understanding to forecast future interactions
- Prospective Diagnostic Forecasting
- Testing method that evaluates whether models can identify state-consistent dialogue trajectories from mental-state profiles
- Reasoning Asymmetry
- Performance gap where models excel at one cognitive task but fail at related practical applications
- State-Consistent Trajectories
- Dialogue paths that remain logically aligned with identified mental states and trust factors