What question did this study set out to answer?

The aim is to improve the accuracy and relevance of medical question answering using a hybrid retrieval-augmented generation framework.

February 2, 2026Open Access

Enhancing Medical Question Answering with LLMs via a Hybrid Retrieval-Augmented Generation Framework

Key Points

The aim is to improve the accuracy and relevance of medical question answering using a hybrid retrieval-augmented generation framework.
Examined a modular retrieval-augmented generation framework
Combined sparse retrieval (BM25) and dense retrieval (MedCPT)
Evaluated on benchmark healthcare datasets: PubMedQA, MedMCQA, and MedQA-US
Used metrics like context precision, recall, and generation metrics such as BERTScore.
Hybrid retriever achieved 92.14% recall, 74.36% precision, F1-score of 82.30%
GPT-4o reached 89.4% faithfulness, 82.7% answer relevancy, F1BERT of 88.0% on PubMedQA
Substantial improvement in retrieval effectiveness and response quality.

Abstract

Given the knowledge-intensive and rapidly expanding nature of medical field, accurately synthesizing and interpreting findings remain a major challenge for clinicians and medical students. Although Large Language Models (LLMs) have advanced automated summarization or generated responses, their deployment is limited by hallucinations, outdated knowledge, and insufficient domain adaptation. Retrieval-Augmented Generation (RAG) addresses these issues by grounding LLMs in external knowledge bases. However, as the document corpus scales, maintaining RAG accuracy becomes increasingly difficult, making retrievers critical for contextual relevance. In this paper, we examined the efficiency of a modular RAG framework with a hybrid retrieval strategy that combines sparse retrieval (BM25) and dense retrieval (MedCPT) to extract the most relevant documents from the corpus, thereby providing contextual grounding for the LLM to improve medical responses. Evaluation was conducted on three benchmark healthcare datasets: PubMedQA, MedMCQA, and MedQA-US, using two LLMs, GPT-4o and BioGPT. Performance was assessed using retrieval metrics (context precision, context recall, F1-score) and generation metrics (BERTScore, RAG Assessment Score). The hybrid retriever achieved 92.14% recall, 74.36% precision, and an F1-score of 82.30%. GPT-4o with hybrid retrieval reached 89.4% faithfulness, 82.7% answer relevancy, and an F1BERT of 88.0% on PubMedQA. Results demonstrated that hybrid retrieval within a modular architecture substantially improves retrieval effectiveness and response quality. The proposed work offers a scalable, generalizable solution for high-stakes healthcare applications, supporting flexible retriever integration and robust evaluation to advance transparent QA systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Bushra Aljohani

Tawfeeq Alsanoosy

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Enhancing Medical Question Answering with LLMs via a Hybrid Retrieval-Augmented Generation Framework

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider