New research reveals that FHIR data serialization strategy significantly impacts LLM medication reconciliation performance. Clinical Narrative format outperforms Raw JSON by up to 19 F1 points for models under 8B parameters, while larger 70B models perform best with Raw JSON format.
| Released by | arXiv researchers |
|---|---|
| Release date | |
| What it is | Systematic comparison of FHIR serialization strategies for LLM medication reconciliation |
| Who it is for | Healthcare AI developers and clinical informatics teams |
| Where to get it | arXiv preprint |
| Price | Open access |
- Researchers tested 5 open-weight LLMs across 4 FHIR serialization strategies on 200 synthetic patients
- Clinical Narrative format improves performance by up to 19 F1 points for smaller models
- Raw JSON performs best for 70B parameter models with 0.9956 mean F1 score
- All models show higher precision than recall, indicating omission as the dominant failure mode
- BioMistral-7B produces zero usable output despite domain pretraining without instruction tuning
- FHIR serialization strategy choice can impact medication reconciliation F1 scores by up to 19 points
- Clinical Narrative format works best for models up to 8B parameters
- Raw JSON format achieves optimal performance for 70B+ parameter models
- Domain pretraining alone without instruction tuning produces unusable results
- Smaller models plateau at 7-10 concurrent medications, underserving polypharmacy patients
What is FHIR serialization for LLM medication reconciliation
FHIR serialization for LLM medication reconciliation involves converting structured patient health records into text formats that language models can process for identifying medication discrepancies. The research examines how different data presentation formats affect model accuracy in this critical healthcare task.
Medication reconciliation at clinical handoffs represents a high-stakes, error-prone process where healthcare providers must identify discrepancies between prescribed and actual medications [1]. Large language models increasingly assist with this task using FHIR-structured patient records, but the fundamental variable of data serialization remains largely unstudied [1].
The study evaluates four distinct serialization approaches: Raw JSON maintains the original structured format, Markdown Table presents data in tabular form, Clinical Narrative converts information to prose, and Chronological Timeline organizes events by time sequence [1].
What is new vs previous research
This represents the first systematic comparison of FHIR serialization strategies for LLM medication reconciliation across multiple model sizes and architectures.
| Previous Research | This Study |
|---|---|
| Limited serialization format testing | Systematic comparison of 4 serialization strategies |
| Single model evaluations | 5 open-weight models tested (Phi-3.5-mini to Llama-3.3-70B) |
| Small-scale validation | 200 synthetic patients, 4,000 total inference runs |
| Unclear format recommendations | Evidence-based format guidelines by model size |
| Limited failure mode analysis | Detailed precision vs recall analysis showing omission patterns |
How does FHIR serialization impact LLM performance
FHIR serialization impacts LLM performance through information density, structural clarity, and cognitive load differences across presentation formats.
- Format Processing: Models receive identical patient data in Raw JSON, Markdown Table, Clinical Narrative, or Chronological Timeline formats
- Model Size Dependency: Smaller models (under 8B parameters) benefit from Clinical Narrative’s natural language structure
- Large Model Advantage: 70B parameter models leverage Raw JSON’s structured precision for optimal accuracy
- Failure Mode Patterns: All models show higher precision than recall, indicating systematic medication omission rather than fabrication
- Complexity Limitations: Models plateau at 7-10 concurrent medications regardless of serialization strategy
Benchmarks and evidence
The study provides comprehensive performance data across 20 model-strategy combinations with statistically significant results.
| Model | Best Format | F1 Score | Performance Gain | Source |
|---|---|---|---|---|
| Mistral-7B | Clinical Narrative | Not yet disclosed | +19 F1 points vs Raw JSON | [2] |
| Llama-3.3-70B | Raw JSON | 0.9956 | Best overall performance | [2] |
| BioMistral-7B | None | 0 | Zero usable output | [2] |
| All models | Varies | Not yet disclosed | Precision > Recall consistently | [2] |
The research demonstrates statistically significant effects with correlation coefficient r = 0.617 and p-value less than 10^-10 for serialization strategy impact [2]. The complete pipeline runs reproducibly on AWS g6e.xlarge instances with NVIDIA L40S GPUs and 48 GB VRAM [2].
Who should care
Builders
Healthcare AI developers should implement format-specific pipelines based on model size constraints. Clinical Narrative serialization provides optimal results for resource-constrained deployments using models under 8B parameters, while Raw JSON maximizes accuracy for large-scale implementations.
Enterprise
Healthcare organizations deploying LLM-assisted medication reconciliation systems need evidence-based serialization strategies to minimize patient safety risks. The research provides actionable guidelines for clinical informatics teams implementing FHIR-based AI workflows.
End users
Clinical pharmacists and healthcare providers benefit from understanding LLM limitations in polypharmacy cases. The study reveals systematic underperformance for patients with 7+ concurrent medications, requiring enhanced human oversight protocols.
Investors
Healthcare AI investment decisions should consider serialization optimization as a competitive differentiator. Companies implementing evidence-based format strategies may achieve significant performance advantages in clinical AI markets.
How to use these findings today
Healthcare AI teams can immediately implement optimized FHIR serialization strategies based on their model deployment constraints.
- Assess Model Size: Determine if your deployment uses models under or over 8B parameters
- Choose Format: Implement Clinical Narrative for smaller models, Raw JSON for 70B+ models
- Validate Performance: Test serialization impact on your specific patient population and use cases
- Monitor Omissions: Implement enhanced oversight for medication omission detection given precision-recall patterns
- Scale Considerations: Plan additional safeguards for polypharmacy patients with 7+ concurrent medications
FHIR serialization strategies comparison
| Strategy | Best For | Advantages | Limitations |
|---|---|---|---|
| Clinical Narrative | Models ≤8B parameters | Natural language processing, up to +19 F1 points | Less structured, potential ambiguity |
| Raw JSON | Models ≥70B parameters | Structured precision, 0.9956 F1 score | Requires larger computational resources |
| Markdown Table | Moderate complexity cases | Visual organization, human readable | Limited performance gains shown |
| Chronological Timeline | Time-sensitive workflows | Temporal context preservation | May miss concurrent medication patterns |
Risks, limits, and myths
- Polypharmacy Limitation: All models plateau at 7-10 concurrent medications, systematically underserving high-risk patients
- Omission Bias: Models consistently miss active medications more than fabricating false ones, requiring targeted safety protocols
- Domain Pretraining Myth: BioMistral-7B produces zero usable output despite medical domain pretraining without instruction tuning
- Synthetic Data Limits: Results based on 200 synthetic patients may not generalize to real clinical populations
- Hardware Requirements: Optimal performance requires significant computational resources (48 GB VRAM for 70B models)
- Format Dependency: Performance gains are model-size dependent, requiring careful deployment planning
FAQ
What FHIR serialization format works best for small LLMs in healthcare?
Clinical Narrative format performs best for models up to 8B parameters, providing up to 19 F1 points improvement over Raw JSON for medication reconciliation tasks [2].
Which LLM size performs best for medication reconciliation?
Llama-3.3-70B achieves the highest performance with 0.9956 mean F1 score using Raw JSON serialization format [2].
Do domain-pretrained medical LLMs work better for FHIR data?
BioMistral-7B produces zero usable output despite domain pretraining, showing that medical pretraining alone without instruction tuning is insufficient [2].
How many medications can LLMs handle simultaneously?
Smaller models plateau at roughly 7-10 concurrent active medications, leaving polypharmacy patients systematically underserved [2].
What hardware is needed to reproduce these FHIR LLM results?
The complete pipeline runs reproducibly on AWS g6e.xlarge instances with NVIDIA L40S GPUs and 48 GB VRAM [2].
Do LLMs miss medications or add fake ones more often?
All models show higher precision than recall, meaning omission is the dominant failure mode rather than fabrication [2].
How many patients were tested in this FHIR serialization study?
Researchers tested 5 open-weight LLMs across 4 serialization strategies on 200 synthetic patients, totaling 4,000 inference runs [3].
What statistical significance did the FHIR format comparison achieve?
The serialization strategy effect shows statistical significance with correlation coefficient r = 0.617 and p-value less than 10^-10 [2].
Can I run these FHIR LLM tests locally?
Yes, researchers ran 5 open-weight models locally via Ollama using Q4_K_M quantization on an L40S GPU [4].
Which models were compared in the FHIR serialization benchmark?
The study compared Phi-3.5-mini, Mistral-7B, BioMistral-7B, Llama-3.1-8B, and Llama-3.3-70B across four serialization strategies [3].
Glossary
- FHIR
- Fast Healthcare Interoperability Resources, a standard for exchanging healthcare information electronically
- Medication Reconciliation
- Process of comparing patient medication lists across care transitions to identify and resolve discrepancies
- Serialization
- Converting structured data into a format suitable for transmission or processing by computer systems
- F1 Score
- Harmonic mean of precision and recall, measuring model accuracy in classification tasks
- Polypharmacy
- Concurrent use of multiple medications by a patient, typically 5 or more drugs
- Clinical Handoff
- Transfer of patient care responsibility between healthcare providers or care settings
- Instruction Tuning
- Fine-tuning process that teaches language models to follow specific instructions and produce structured outputs
- Domain Pretraining
- Initial training of language models on specialized datasets from specific fields like medicine
Sources
- Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation — https://arxiv.org/html/2604.21076
- [2604.21076] Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation — https://arxiv.org/abs/2604.21076
- Seeking arXiv cs.CL endorsement. FHIR medication reconciliation with LLMs – Research – Hugging Face Forums — https://discuss.huggingface.co/t/seeking-arxiv-cs-cl-endorsement-fhir-medication-reconciliation-with-llms/175459
- Seeking arXiv cs.CL endorsement, local LLM clinical NLP benchmark (Ollama, 5 models) — https://community.deeplearning.ai/t/seeking-arxiv-cs-cl-endorsement-local-llm-clinical-nlp-benchmark-ollama-5-models/891721