What question did this study set out to answer?

The research aims to evaluate the incidence of hallucinations in biomedical Large Language Models using cardiac-related texts.

March 12, 2026Open Access

Hallucination Detection in Biomedical LLMs

Key Points

The research aims to evaluate the incidence of hallucinations in biomedical Large Language Models using cardiac-related texts.
Systematic evaluation of various biomedical LLMs focused on cardiology.
Analysis of models including BioMistral and MEDITRON for factual accuracy.
Utilization of three automatic evaluation metrics: HHEM, AlignScore, and MiniCheck.
Significant variability in hallucination rates across different LLM models.
Multiple evaluation metrics proved effective in assessing factual consistency.
Insights provided on the necessity of robust hallucination mitigation strategies.

Abstract

Large Language Models (LLMs) have shown remarkable capabilities in natural language generation tasks, including summarization. However, their tendency to produce hallucinations, plausible-sounding but factually incorrect or unsupported content, poses serious risks, especially in high-stakes fields like biomedicine. This thesis presents a systematic evaluation of hallucination in domain-specific LLMs using biomedical texts sourced from PubMed, with a focus on the cardiology domain. A diverse set of open-source biomedical models—including multiple variants of BioMistral and MEDITRON—are assessed for factual consistency in generated summaries. To evaluate hallucinations, three automatic metrics are employed: Vectara's Hallucination Evaluation Model (HHEM), AlignScore, and MiniCheck. The results demonstrate significant variability in hallucination propensity across models and highlight the effectiveness of using multiple detectors to triangulate factual accuracy. Our analysis offers a deeper understanding of the trade-offs in biomedical LLMs and emphasizes the need for rigorous hallucination mitigation strategies in their deployment. This work contributes a structured benchmarking methodology and empirical insights to support safer and more reliable use of LLMs in biomedical contexts.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Χρήστος Γ. Βράνης

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Hallucination Detection in Biomedical LLMs

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider