What type of study is this?

This is a Quantitative Study study.

October 9, 2025Open Access

Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases

Puntos clave

LLM-based evaluation showed stronger alignment (ρ=0.758) with physician scores compared to traditional NLP metrics.
The reference-guided evaluation using GPT-4 achieved the highest correlation with physician-assigned scores at 0.758.
Traditional NLP metrics provided weak correlations (ρ=0.240-0.403) with physician scores, indicating limitations in assessing response accuracy.
Reference-guided scoring methods demonstrate a scalable approach for evaluating LLM-generated responses about rare diseases.

Resumen

Objectives Patients with rare diseases often struggle to find accurate medical information, and large language model (LLM)-based chatbots may help meet this need. However, evaluating LLM-generated free-text answers typically requires physician review, which is time-consuming and difficult to scale. This study compared traditional natural language processing (NLP) metrics to emerging LLM-based evaluation approaches for assessing answer quality in the context of Complex Lymphatic Anomalies (CLAs). Materials and Methods We compiled 25 common patients' questions about CLAs and generated 175 responses to these questions from seven LLMs. Three expert physicians scored these responses for accuracy. We compared these physician-assigned scores with automated scores, generated by four NLP sentence similarity metrics (BLEU, ROUGE, METEOR, BERTScore) and six LLM evaluators (GPT-4, GPT-4o, Qwen3-32B, DeepSeek-R1-14B, Gemma3-27B, LLaMA3.3-70B). We examined both LLM-based scoring with and without reference answers (reference-guided vs. reference-free). We calculated Spearman, Phi, and Kendall's Tau correlation coefficients to assess alignment between automated and physician-assigned scores. Results LLM-based evaluation demonstrated stronger alignment with physician-assigned scores than NLP metrics. The reference-guided GPT-4 evaluator achieved the highest correlation with physician-assigned scores (ρ=0.758), followed by GPT-4o (ρ=0.727). NLP metrics showed weak to moderate correlations with physician-assigned scores (ρ=0.240-0.403). Reference-guided scoring outperformed reference-free methods. Discussion Reference-guided LLM-based evaluation methods approximate expert physicians' judgment better than traditional NLP metrics, offering an effective, scalable approach for assessing LLM-generated responses to patient questions about rare disease. Conclusion LLM-based evaluation, particularly reference-guided scoring with GPT models, can support the scalable development and evaluation of LLM-based rare disease-specific chatbot systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Min Zhao

Inez Y. Oh

Aditi Gupta

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Automating Evaluation of LLM-generated Responses to Patient Questions about Rare Diseases

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider