Version 2.0 — Substantially revised from v1.0 (9 April 2026). Title reframed to emphasise ablation and error decomposition findings. Added 5 critical citations including RDguru (Yang 2025), Reese et al. (2026), Chimirri et al. (2025), Zhong et al. (2025), and DeepRare (Zhao et al. 2026, Nature). Reference numbering corrected. Comparison table added. CRediT author contributions. Affiliation corrected to Independent Researcher. This study develops and evaluates a retrieval-augmented generation (RAG) system for rare disease diagnosis, combining structured knowledge retrieval from Orphanet (4,293 diseases) and PubMed case reports (1,832 chunks) with LLM reasoning. The system is evaluated on 85 clinical vignettes including 70 ultra-rare disease cases. The primary contribution is a systematic ablation and error decomposition identifying which retrieval components drive performance and where failures occur. Key Findings: RAG achieved 54.3% top-1 diagnostic accuracy on ultra-rare diseases vs 38.6% for LLM-only (p=0.001), consistent with prior systems (RDguru, DeepRare) HPO phenotype matching was the most valuable retrieval component (−5.0 pp when removed) Cross-encoder reranking trained on web data was counterproductive (+5.0 pp when removed) General-purpose embeddings outperformed biomedical-specific models (BioLORD-2023, BiomedBERT) 85% of diagnostic failures were retrieval failures, not generation failures LLM confidence calibration was poor — 93% of predictions labelled high confidence regardless of correctness Implications: Retrieval architecture choices — particularly structured ontology matching — are the primary constraint on RAG-assisted rare disease diagnosis. Investment should prioritise retrieval quality over generation capability. These component-level findings complement existing end-to-end systems (RDguru, DeepRare) by providing design guidance for next-generation rare disease diagnostic tools. Contents: Main manuscript (PDF and DOCX), supplementary information (PDF with additional tables, per-case results, and prompt templates), and high-resolution main figures (5 PNG). Changes from v1.0: Title reframed around ablation/error decomposition (from generic "RAG improves...") Abstract restructured to lead with ablation findings as primary contribution 5 critical citations added: RDguru, Reese et al., Chimirri et al., Zhong et al., DeepRare (Nature 2026) Related Work section expanded with competitive landscape positioning Comparison table (Table 1) added positioning this study vs prior systems Discussion rewritten to acknowledge confirmatory nature of RAG-vs-LLM result Reference numbering corrected to sequential first-appearance order (33 refs, up from 28) Absolute case counts added to ablation results Confidence calibration figure reconciled (93% on 70-case set; 90% on 40-case subset) Affiliation corrected to Independent Researcher, Finley, NSW, Australia
Building similarity graph...
Analyzing shared references across papers
Loading...
Hayden Farquhar
Building similarity graph...
Analyzing shared references across papers
Loading...
Hayden Farquhar (Mon,) studied this question.
www.synapsesocial.com/papers/69df2c2fe4eeef8a2a6b145e — DOI: https://doi.org/10.5281/zenodo.19548279