Multimodal large language models (MLLMs), including OpenAI’s GPT-4.1, demonstrate significantly lower diagnostic accuracy and triage reliability in real-world dermatology cases compared to their performance on public benchmarks, according to a recent arXiv study published on . The research, which evaluated four open-weight MLLMs and GPT-4.1 across 5,811 hospital-based cases, found a substantial “benchmark-to-bedside” gap, with top-3 diagnostic accuracy dropping from up to 42.25% on benchmarks to as low as 24.65% for GPT-4.1 and 1.50% for some open-weight models when using real-world images alone.
- MLLMs, including GPT-4.1, show a significant performance drop in real-world dermatology compared to public benchmarks.
- Top-3 diagnostic accuracy for GPT-4.1 fell from 42.25% on benchmarks to 24.65% with real-world images alone.
- Incorporating clinical context improved accuracy for all models, but outputs were highly sensitive to incomplete or erroneous information.
- Models achieved moderate sensitivity (above 60%) for severity-based triage, suggesting potential for screening but not clinical deployment.
- The study highlights that current MLLM benchmarks substantially overestimate clinical utility in dermatology.
What changed
The core finding from the arXiv study is a quantified “benchmark-to-bedside” gap in dermatology for MLLMs. While multimodal LLMs have been a significant development since , enabling them to process images alongside text, their real-world clinical efficacy has remained largely unvalidated beyond curated datasets [1]. This research directly addresses that gap by moving beyond public benchmarks to evaluate models against a retrospective multi-site hospital-based dermatology consultation cohort comprising 5,811 cases and 46,405 clinical images.
Previous evaluations of LLMs often rely on quantitative scores from benchmarks, which can simplify model comparison but may not reflect real-world complexity [4]. This study specifically tested four open-weight MLLMs (InternVL-Chat v1.5, LLaVA-Med v1.5, SkinGPT4, and MedGemma-4B-Instruct, which is based on Google’s Gemma model [2]) and one commercial MLLM (GPT-4.1) on two clinically relevant tasks: differential diagnosis generation and severity-based triage. The key change is the direct comparison of performance on public datasets versus a large, complex real-world dataset, revealing a substantial degradation in accuracy when confronted with the nuances of actual clinical data.
Why it matters for operators
For operators in healthcare AI, this study is a stark reminder that impressive benchmark numbers often represent an idealized environment, not the messy reality of clinical practice. The “benchmark-to-bedside” gap isn’t just a statistical anomaly; it’s a critical operational risk. Founders building diagnostic AI tools need to internalize that their product’s true value will be judged by its performance on diverse, noisy, and often incomplete real-world data, not just curated test sets. This means investing heavily in data acquisition strategies that mirror actual clinical workflows, including varied image quality, incomplete patient histories, and the full spectrum of disease presentation, rather than relying solely on publicly available datasets.
Engineers should focus on robust error handling and uncertainty quantification in their MLLM-powered systems. The finding that model outputs are “highly sensitive to incomplete or erroneous consultation context” underscores the need for systems that can gracefully degrade or explicitly flag uncertainty when presented with suboptimal input, rather than confidently generating incorrect diagnoses. Traders and investors should view claims of high accuracy in medical AI with skepticism unless they are backed by rigorous, real-world validation studies that go beyond standard benchmarks. The current MLLM landscape in dermatology suggests that while these models have potential for screening or as decision support tools, they are far from autonomous diagnostic agents. The operational takeaway is clear: focus on augmented intelligence, where AI supports human experts, rather than attempting to replace them, especially in high-stakes domains like medical diagnosis.
Benchmarks and evidence
The study provides a clear quantitative comparison between MLLM performance on public benchmarks and a real-world clinical cohort.
Top-3 Diagnostic Accuracy
The “top-3 diagnostic accuracy” metric refers to whether the correct diagnosis was among the top three differential diagnoses suggested by the model.
| Model | Public Benchmarks (Top-3) | Real-World Images Alone (Top-3) | Real-World with Context (Top-3) |
|---|---|---|---|
| Best Open-Weight Model | 26.55% | 1.50% – 13.35% | Up to 28.75% |
| GPT-4.1 | 42.25% | 24.65% | 38.93% |
Source: arXiv cs.CV,
This data demonstrates a substantial drop in performance when models are moved from public, likely cleaner, benchmarks to real-world clinical images. For the best open-weight model, accuracy plummeted from 26.55% to a range of 1.50%-13.35% using images alone. GPT-4.1, while performing better overall, still saw its top-3 accuracy fall from 42.25% on benchmarks to 24.65% with real-world images. The incorporation of clinical context significantly improved performance across all models, highlighting the importance of rich patient data beyond just images.
Severity-Based Triage
For severity-based triage, models achieved “moderate sensitivity (above 60%)”. This suggests a potential for identifying cases requiring urgent attention, but the study explicitly states this is “insufficient reliability for clinical deployment” due to other unspecified performance metrics (likely specificity or overall balanced accuracy not being high enough).
Risks and open questions
- Generalizability beyond Dermatology: While this study focuses on dermatology, the “benchmark-to-bedside” gap likely extends to other medical specialties relying on multimodal AI for diagnosis (e.g., radiology, pathology). Operators should question the clinical readiness of MLLMs across all medical domains.
- Data Biases and Representativeness: The real-world cohort of 5,811 cases is substantial, but its representativeness across diverse demographics, skin types, and geographical regions is not fully detailed. Biases in training data could exacerbate real-world performance issues.
- Interpretability and Explainability: The study highlights sensitivity to incomplete or erroneous context. This raises questions about how MLLMs arrive at their conclusions and whether their reasoning can be audited or explained to clinicians, which is crucial for trust and adoption.
- Dynamic Clinical Context: Real clinical context is often dynamic and evolving. How well do MLLMs adapt to new information or changes in patient status over time? The study primarily focuses on a single snapshot of consultation context.
- Regulatory Pathways: Given the demonstrated gap, what are the appropriate regulatory pathways for deploying such models? Current benchmarks may be insufficient for regulatory approval, necessitating new standards for real-world validation.
Sources
- Large language model – Wikipedia — https://en.wikipedia.org/wiki/Large_language_model
- Google models | Generative AI on Vertex AI | Google Cloud Documentation — https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models
- The Weekly Roundup: January 19-23 | Dermatology Times — https://www.dermatologytimes.com/view/the-weekly-roundup-january-19-23
- What Are Large Language Models (LLMs)? | IBM — https://www.ibm.com/think/topics/large-language-models
- Long-Term Safety of Roflumilast Cream for Children, With Lawrence Eichenfield, MD | HCPLive — https://www.hcplive.com/view/long-term-safety-roflumilast-cream-children-lawrence-eichenfield-md