FHIR Data Format Affects LLM Medication Reconciliation by

New research reveals that FHIR data serialization strategy significantly impacts LLM medication reconciliation performance. Clinical Narrative format outperforms Raw JSON by up to 19 F1 points for models under 8B parameters, while larger 70B models perform best with Raw JSON format.

Released by	arXiv researchers
Release date	April 24, 2026
What it is	Systematic comparison of FHIR serialization strategies for LLM medication reconciliation
Who it is for	Healthcare AI developers and clinical informatics teams
Where to get it	arXiv preprint
Price	Open access

Researchers tested 5 open-weight LLMs across 4 FHIR serialization strategies on 200 synthetic patients
Clinical Narrative format improves performance by up to 19 F1 points for smaller models
Raw JSON performs best for 70B parameter models with 0.9956 mean F1 score
All models show higher precision than recall, indicating omission as the dominant failure mode
BioMistral-7B produces zero usable output despite domain pretraining without instruction tuning

What is FHIR serialization for LLM medication reconciliation
What is new vs previous research
How does FHIR serialization impact LLM performance
Benchmarks and evidence
Who should care
How to use these findings today
FHIR serialization strategies comparison
Risks, limits, and myths

FHIR serialization strategy choice can impact medication reconciliation F1 scores by up to 19 points
Clinical Narrative format works best for models up to 8B parameters
Raw JSON format achieves optimal performance for 70B+ parameter models
Domain pretraining alone without instruction tuning produces unusable results
Smaller models plateau at 7-10 concurrent medications, underserving polypharmacy patients

What is FHIR serialization for LLM medication reconciliation

FHIR serialization for LLM medication reconciliation involves converting structured patient health records into text formats that language models can process for identifying medication discrepancies. The research examines how different data presentation formats affect model accuracy in this critical healthcare task.

Medication reconciliation at clinical handoffs represents a high-stakes, error-prone process where healthcare providers must identify discrepancies between prescribed and actual medications [1]. Large language models increasingly assist with this task using FHIR-structured patient records, but the fundamental variable of data serialization remains largely unstudied [1].

The study evaluates four distinct serialization approaches: Raw JSON maintains the original structured format, Markdown Table presents data in tabular form, Clinical Narrative converts information to prose, and Chronological Timeline organizes events by time sequence [1].

What is new vs previous research

This represents the first systematic comparison of FHIR serialization strategies for LLM medication reconciliation across multiple model sizes and architectures.

Previous Research	This Study
Limited serialization format testing	Systematic comparison of 4 serialization strategies
Single model evaluations	5 open-weight models tested (Phi-3.5-mini to Llama-3.3-70B)
Small-scale validation	200 synthetic patients, 4,000 total inference runs
Unclear format recommendations	Evidence-based format guidelines by model size
Limited failure mode analysis	Detailed precision vs recall analysis showing omission patterns

How does FHIR serialization impact LLM performance

FHIR serialization impacts LLM performance through information density, structural clarity, and cognitive load differences across presentation formats.

Format Processing: Models receive identical patient data in Raw JSON, Markdown Table, Clinical Narrative, or Chronological Timeline formats
Model Size Dependency: Smaller models (under 8B parameters) benefit from Clinical Narrative’s natural language structure
Large Model Advantage: 70B parameter models leverage Raw JSON’s structured precision for optimal accuracy
Failure Mode Patterns: All models show higher precision than recall, indicating systematic medication omission rather than fabrication
Complexity Limitations: Models plateau at 7-10 concurrent medications regardless of serialization strategy

Benchmarks and evidence

The study provides comprehensive performance data across 20 model-strategy combinations with statistically significant results.

Model	Best Format	F1 Score	Performance Gain	Source
Mistral-7B	Clinical Narrative	Not yet disclosed	+19 F1 points vs Raw JSON	[2]
Llama-3.3-70B	Raw JSON	0.9956	Best overall performance	[2]
BioMistral-7B	None	0	Zero usable output	[2]
All models	Varies	Not yet disclosed	Precision > Recall consistently	[2]

The research demonstrates statistically significant effects with correlation coefficient r = 0.617 and p-value less than 10^-10 for serialization strategy impact [2]. The complete pipeline runs reproducibly on AWS g6e.xlarge instances with NVIDIA L40S GPUs and 48 GB VRAM [2].

Who should care

Builders

Healthcare AI developers should implement format-specific pipelines based on model size constraints. Clinical Narrative serialization provides optimal results for resource-constrained deployments using models under 8B parameters, while Raw JSON maximizes accuracy for large-scale implementations.

Enterprise

Healthcare organizations deploying LLM-assisted medication reconciliation systems need evidence-based serialization strategies to minimize patient safety risks. The research provides actionable guidelines for clinical informatics teams implementing FHIR-based AI workflows.

End users

Clinical pharmacists and healthcare providers benefit from understanding LLM limitations in polypharmacy cases. The study reveals systematic underperformance for patients with 7+ concurrent medications, requiring enhanced human oversight protocols.

Investors

Healthcare AI investment decisions should consider serialization optimization as a competitive differentiator. Companies implementing evidence-based format strategies may achieve significant performance advantages in clinical AI markets.

How to use these findings today

Healthcare AI teams can immediately implement optimized FHIR serialization strategies based on their model deployment constraints.

Assess Model Size: Determine if your deployment uses models under or over 8B parameters
Choose Format: Implement Clinical Narrative for smaller models, Raw JSON for 70B+ models
Validate Performance: Test serialization impact on your specific patient population and use cases
Monitor Omissions: Implement enhanced oversight for medication omission detection given precision-recall patterns
Scale Considerations: Plan additional safeguards for polypharmacy patients with 7+ concurrent medications

FHIR serialization strategies comparison

Strategy	Best For	Advantages	Limitations
Clinical Narrative	Models ≤8B parameters	Natural language processing, up to +19 F1 points	Less structured, potential ambiguity
Raw JSON	Models ≥70B parameters	Structured precision, 0.9956 F1 score	Requires larger computational resources
Markdown Table	Moderate complexity cases	Visual organization, human readable	Limited performance gains shown
Chronological Timeline	Time-sensitive workflows	Temporal context preservation	May miss concurrent medication patterns

Risks, limits, and myths

Polypharmacy Limitation: All models plateau at 7-10 concurrent medications, systematically underserving high-risk patients
Omission Bias: Models consistently miss active medications more than fabricating false ones, requiring targeted safety protocols
Domain Pretraining Myth: BioMistral-7B produces zero usable output despite medical domain pretraining without instruction tuning
Synthetic Data Limits: Results based on 200 synthetic patients may not generalize to real clinical populations
Hardware Requirements: Optimal performance requires significant computational resources (48 GB VRAM for 70B models)
Format Dependency: Performance gains are model-size dependent, requiring careful deployment planning

FAQ

What FHIR serialization format works best for small LLMs in healthcare?

Clinical Narrative format performs best for models up to 8B parameters, providing up to 19 F1 points improvement over Raw JSON for medication reconciliation tasks [2].

Which LLM size performs best for medication reconciliation?

Llama-3.3-70B achieves the highest performance with 0.9956 mean F1 score using Raw JSON serialization format [2].

Do domain-pretrained medical LLMs work better for FHIR data?

BioMistral-7B produces zero usable output despite domain pretraining, showing that medical pretraining alone without instruction tuning is insufficient [2].

How many medications can LLMs handle simultaneously?

Smaller models plateau at roughly 7-10 concurrent active medications, leaving polypharmacy patients systematically underserved [2].

What hardware is needed to reproduce these FHIR LLM results?

The complete pipeline runs reproducibly on AWS g6e.xlarge instances with NVIDIA L40S GPUs and 48 GB VRAM [2].

Do LLMs miss medications or add fake ones more often?

All models show higher precision than recall, meaning omission is the dominant failure mode rather than fabrication [2].

How many patients were tested in this FHIR serialization study?

Researchers tested 5 open-weight LLMs across 4 serialization strategies on 200 synthetic patients, totaling 4,000 inference runs [3].

What statistical significance did the FHIR format comparison achieve?

The serialization strategy effect shows statistical significance with correlation coefficient r = 0.617 and p-value less than 10^-10 [2].

Can I run these FHIR LLM tests locally?

Yes, researchers ran 5 open-weight models locally via Ollama using Q4_K_M quantization on an L40S GPU [4].

Which models were compared in the FHIR serialization benchmark?

The study compared Phi-3.5-mini, Mistral-7B, BioMistral-7B, Llama-3.1-8B, and Llama-3.3-70B across four serialization strategies [3].

Glossary

FHIR: Fast Healthcare Interoperability Resources, a standard for exchanging healthcare information electronically
Medication Reconciliation: Process of comparing patient medication lists across care transitions to identify and resolve discrepancies
Serialization: Converting structured data into a format suitable for transmission or processing by computer systems
F1 Score: Harmonic mean of precision and recall, measuring model accuracy in classification tasks
Polypharmacy: Concurrent use of multiple medications by a patient, typically 5 or more drugs
Clinical Handoff: Transfer of patient care responsibility between healthcare providers or care settings
Instruction Tuning: Fine-tuning process that teaches language models to follow specific instructions and produce structured outputs
Domain Pretraining: Initial training of language models on specialized datasets from specific fields like medicine

Download the complete research paper from arXiv to implement evidence-based FHIR serialization strategies for your healthcare AI deployment.

Sources

Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation — https://arxiv.org/html/2604.21076
[2604.21076] Serialisation Strategy Matters: How FHIR Data Format Affects LLM Medication Reconciliation — https://arxiv.org/abs/2604.21076
Seeking arXiv cs.CL endorsement. FHIR medication reconciliation with LLMs – Research – Hugging Face Forums — https://discuss.huggingface.co/t/seeking-arxiv-cs-cl-endorsement-fhir-medication-reconciliation-with-llms/175459
Seeking arXiv cs.CL endorsement, local LLM clinical NLP benchmark (Ollama, 5 models) — https://community.deeplearning.ai/t/seeking-arxiv-cs-cl-endorsement-local-llm-clinical-nlp-benchmark-ollama-5-models/891721

Author

siego237

Writes for FrontierWisdom on AI systems, automation, decentralized identity, and frontier infrastructure, with a focus on turning emerging technology into practical playbooks, implementation roadmaps, and monetization strategies for operators, builders, and consultants.

FHIR Data Format Affects LLM Medication Reconciliation by 19 F1

What is FHIR serialization for LLM medication reconciliation

What is new vs previous research

How does FHIR serialization impact LLM performance

Benchmarks and evidence