Objectives Patients with rare diseases often struggle to find accurate medical information, and large language model (LLM)-based chatbots may help meet this need. However, evaluating LLM-generated free-text answers typically requires physician review, which is time-consuming and difficult to scale. This study compared traditional natural language processing (NLP) metrics to emerging LLM-based evaluation approaches for assessing answer quality in the context of Complex Lymphatic Anomalies (CLAs). Materials and Methods We compiled 25 common patients' questions about CLAs and generated 175 responses to these questions from seven LLMs. Three expert physicians scored these responses for accuracy. We compared these physician-assigned scores with automated scores, generated by four NLP sentence similarity metrics (BLEU, ROUGE, METEOR, BERTScore) and six LLM evaluators (GPT-4, GPT-4o, Qwen3-32B, DeepSeek-R1-14B, Gemma3-27B, LLaMA3.3-70B). We examined both LLM-based scoring with and without reference answers (reference-guided vs. reference-free). We calculated Spearman, Phi, and Kendall's Tau correlation coefficients to assess alignment between automated and physician-assigned scores. Results LLM-based evaluation demonstrated stronger alignment with physician-assigned scores than NLP metrics. The reference-guided GPT-4 evaluator achieved the highest correlation with physician-assigned scores (ρ=0.758), followed by GPT-4o (ρ=0.727). NLP metrics showed weak to moderate correlations with physician-assigned scores (ρ=0.240-0.403). Reference-guided scoring outperformed reference-free methods. Discussion Reference-guided LLM-based evaluation methods approximate expert physicians' judgment better than traditional NLP metrics, offering an effective, scalable approach for assessing LLM-generated responses to patient questions about rare disease. Conclusion LLM-based evaluation, particularly reference-guided scoring with GPT models, can support the scalable development and evaluation of LLM-based rare disease-specific chatbot systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Min Zhao
Inez Y. Oh
Aditi Gupta
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhao et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68e70da790569dd607ee59fc — DOI: https://doi.org/10.1101/2025.10.06.25337181
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: