What question did this study set out to answer?

This research aims to evaluate how retrieval quality impacts the effectiveness of a RAG system for diagnosing rare diseases.

April 15, 2026Open Access

Retrieval quality, not model capability, constrains RAG-assisted rare disease diagnosis: ablation and error analysis on 85 clinical vignettes

Key Points

This research aims to evaluate how retrieval quality impacts the effectiveness of a RAG system for diagnosing rare diseases.
Developed a RAG system integrating structured knowledge retrieval from Orphanet and PubMed.
Evaluated the system on 85 clinical vignettes, focusing on 70 ultra-rare disease cases.
Performed systematic ablation and error decomposition to identify key components affecting performance.
Achieved 54.3% diagnostic accuracy for ultra-rare diseases with RAG, compared to 38.6% with LLM-only (p=0.001).
HPO phenotype matching significantly influenced retrieval effectiveness, while cross-encoder reranking detracted from performance.
85% of diagnostic failures were attributed to retrieval issues rather than failures in the generation process.

Abstract

Version 2.0 — Substantially revised from v1.0 (9 April 2026). Title reframed to emphasise ablation and error decomposition findings. Added 5 critical citations including RDguru (Yang 2025), Reese et al. (2026), Chimirri et al. (2025), Zhong et al. (2025), and DeepRare (Zhao et al. 2026, Nature). Reference numbering corrected. Comparison table added. CRediT author contributions. Affiliation corrected to Independent Researcher. This study develops and evaluates a retrieval-augmented generation (RAG) system for rare disease diagnosis, combining structured knowledge retrieval from Orphanet (4,293 diseases) and PubMed case reports (1,832 chunks) with LLM reasoning. The system is evaluated on 85 clinical vignettes including 70 ultra-rare disease cases. The primary contribution is a systematic ablation and error decomposition identifying which retrieval components drive performance and where failures occur. Key Findings: RAG achieved 54.3% top-1 diagnostic accuracy on ultra-rare diseases vs 38.6% for LLM-only (p=0.001), consistent with prior systems (RDguru, DeepRare) HPO phenotype matching was the most valuable retrieval component (−5.0 pp when removed) Cross-encoder reranking trained on web data was counterproductive (+5.0 pp when removed) General-purpose embeddings outperformed biomedical-specific models (BioLORD-2023, BiomedBERT) 85% of diagnostic failures were retrieval failures, not generation failures LLM confidence calibration was poor — 93% of predictions labelled high confidence regardless of correctness Implications: Retrieval architecture choices — particularly structured ontology matching — are the primary constraint on RAG-assisted rare disease diagnosis. Investment should prioritise retrieval quality over generation capability. These component-level findings complement existing end-to-end systems (RDguru, DeepRare) by providing design guidance for next-generation rare disease diagnostic tools. Contents: Main manuscript (PDF and DOCX), supplementary information (PDF with additional tables, per-case results, and prompt templates), and high-resolution main figures (5 PNG). Changes from v1.0: Title reframed around ablation/error decomposition (from generic "RAG improves...") Abstract restructured to lead with ablation findings as primary contribution 5 critical citations added: RDguru, Reese et al., Chimirri et al., Zhong et al., DeepRare (Nature 2026) Related Work section expanded with competitive landscape positioning Comparison table (Table 1) added positioning this study vs prior systems Discussion rewritten to acknowledge confirmatory nature of RAG-vs-LLM result Reference numbering corrected to sequential first-appearance order (33 refs, up from 28) Absolute case counts added to ablation results Confidence calibration figure reconciled (93% on 70-case set; 90% on 40-case subset) Affiliation corrected to Independent Researcher, Finley, NSW, Australia

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Hayden Farquhar

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Retrieval quality, not model capability, constrains RAG-assisted rare disease diagnosis: ablation and error analysis on 85 clinical vignettes

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study