Do different large language models vary in the quality, accuracy, and safety of treatment plans generated for stage two hypertension?
12 Large Language Models (ChatGPT-4o, Claude, ClinicalKey AI, Copilot, DeepSeek-V3, Dyna AI, Google Gemini, Grok, Meta AI, OpenEvidence, Perplexity, and Pi)
Prompting to generate a treatment plan for stage two hypertension
Comparison among the 12 different Large Language Models
Composite score based on adherence to clinical guidelines, detail/clarity, and reliability/safety (scored by 5 blinded reviewers)
While LLMs generally provide detailed and guideline-adherent management plans for stage two hypertension, there is significant variability in quality, highlighting the need for regulation regarding reliability and safety.
Background: The use of large language models (LLMs) in clinical practice, medical education, and by patients is increasing rapidly. It is essential to ensure that information provided by these artificial intelligence (AI) chatbots is accurate and safe. Our goal was to analyze and compare hypertension treatment plans generated by popular LLMs and identify their strengths and limitations. Methods: ChatGPT-4o, Claude, ClinicalKey AI, Copilot (Microsoft), DeepSeek-V3, Dyna AI, Google Gemini, Grok, Meta AI, OpenEvidence, Perplexity, and Pi were prompted to generate a treatment plan for stage two hypertension. Five blinded reviewers scored each response in three domains: adherence to clinical guidelines, detail/clarity, and reliability/safety (sources provided/emphasis on seeing a healthcare professional). Mean scores for each domain were calculated and summed for a composite score. The responses were also analyzed qualitatively by three reviewers. Results: Perplexity received the highest composite score (8.2 out of 9), followed by OpenEvidence (7.7 out of 9). Dyna AI had the lowest overall score (3.7 out of 9), followed by Pi (4.8 out of 9), ClinicalKey AI (4.9 out of 9), and Meta AI (4.9 out of 9). Perplexity (3 out of 3), Grok (2.8 out of 3), and OpenEvidence (2.7 out of 3) had the highest scores for detailed and clear responses, while DynaAI had the lowest for both detail/clarity (1 out of 3) and reliability/safety (1 out of 3). ChatGPT-4o had the highest score for adherence to guidelines (2.7 out of 3) while Pi had the lowest (1.5 out of 3). Analysis of Variance (ANOVA) statistical test showed statistically significant differences across every subscore domain and composite scores (p=0.00125 for adherence to guidelines, p<0.00001 for detail/clarity, p=0.00003 for sources/seeing a professional, p<0.00001 for composite scores). Qualitatively, the LLMs tended to adhere to guidelines and provide sufficiently detailed management plans, but often did not provide sources and/or advise users to see a healthcare professional. Conclusions: The LLMs nearly always provided detailed management plans for stage two hypertension that adhere to clinical guidelines. However, there was significant variability in quality by different chatbots. Notably, medicine-specific LLMs were not necessarily superior to LLMs used by the general public. AI chatbots may require greater regulation to ensure that they inform users to see a healthcare professional and provide reliable sources.
Building similarity graph...
Analyzing shared references across papers
Loading...
Tai Metzger
Kody Park
Minh Stephenson Vu
Circulation
Oakland University
Building similarity graph...
Analyzing shared references across papers
Loading...
Metzger et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69fadad703f892aec9b1e7bc — DOI: https://doi.org/10.1161/cir.153.suppl_1.we457