Large Language Models (LLMs) have shown remarkable capabilities in natural language generation tasks, including summarization. However, their tendency to produce hallucinations, plausible-sounding but factually incorrect or unsupported content, poses serious risks, especially in high-stakes fields like biomedicine. This thesis presents a systematic evaluation of hallucination in domain-specific LLMs using biomedical texts sourced from PubMed, with a focus on the cardiology domain. A diverse set of open-source biomedical models—including multiple variants of BioMistral and MEDITRON—are assessed for factual consistency in generated summaries. To evaluate hallucinations, three automatic metrics are employed: Vectara's Hallucination Evaluation Model (HHEM), AlignScore, and MiniCheck. The results demonstrate significant variability in hallucination propensity across models and highlight the effectiveness of using multiple detectors to triangulate factual accuracy. Our analysis offers a deeper understanding of the trade-offs in biomedical LLMs and emphasizes the need for rigorous hallucination mitigation strategies in their deployment. This work contributes a structured benchmarking methodology and empirical insights to support safer and more reliable use of LLMs in biomedical contexts.
Building similarity graph...
Analyzing shared references across papers
Loading...
Χρήστος Γ. Βράνης
Building similarity graph...
Analyzing shared references across papers
Loading...
Χρήστος Γ. Βράνης (Wed,) studied this question.
www.synapsesocial.com/papers/69b25aea96eeacc4fcec927b — DOI: https://doi.org/10.26262/heal.auth.ir.370325
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: