What question did this study set out to answer?

This research aims to evaluate and compare the effectiveness of treatment plans for hypertension generated by several large language models.

May 6, 2026

Abstract WE457: Comparative Analysis of Hypertension Management Plans Generated by Large Language Models

Key Points

This research aims to evaluate and compare the effectiveness of treatment plans for hypertension generated by several large language models.
Ten large language models were prompted to create treatment plans for stage two hypertension.

Structured PICO

Do different large language models vary in the quality, accuracy, and safety of treatment plans generated for stage two hypertension?

Population

12 Large Language Models (ChatGPT-4o, Claude, ClinicalKey AI, Copilot, DeepSeek-V3, Dyna AI, Google Gemini, Grok, Meta AI, OpenEvidence, Perplexity, and Pi)

Intervention

Prompting to generate a treatment plan for stage two hypertension

Comparator

Comparison among the 12 different Large Language Models

Outcome

Composite score based on adherence to clinical guidelines, detail/clarity, and reliability/safety (scored by 5 blinded reviewers)

While LLMs generally provide detailed and guideline-adherent management plans for stage two hypertension, there is significant variability in quality, highlighting the need for regulation regarding reliability and safety.

Abstract

Background: The use of large language models (LLMs) in clinical practice, medical education, and by patients is increasing rapidly. It is essential to ensure that information provided by these artificial intelligence (AI) chatbots is accurate and safe. Our goal was to analyze and compare hypertension treatment plans generated by popular LLMs and identify their strengths and limitations. Methods: ChatGPT-4o, Claude, ClinicalKey AI, Copilot (Microsoft), DeepSeek-V3, Dyna AI, Google Gemini, Grok, Meta AI, OpenEvidence, Perplexity, and Pi were prompted to generate a treatment plan for stage two hypertension. Five blinded reviewers scored each response in three domains: adherence to clinical guidelines, detail/clarity, and reliability/safety (sources provided/emphasis on seeing a healthcare professional). Mean scores for each domain were calculated and summed for a composite score. The responses were also analyzed qualitatively by three reviewers. Results: Perplexity received the highest composite score (8.2 out of 9), followed by OpenEvidence (7.7 out of 9). Dyna AI had the lowest overall score (3.7 out of 9), followed by Pi (4.8 out of 9), ClinicalKey AI (4.9 out of 9), and Meta AI (4.9 out of 9). Perplexity (3 out of 3), Grok (2.8 out of 3), and OpenEvidence (2.7 out of 3) had the highest scores for detailed and clear responses, while DynaAI had the lowest for both detail/clarity (1 out of 3) and reliability/safety (1 out of 3). ChatGPT-4o had the highest score for adherence to guidelines (2.7 out of 3) while Pi had the lowest (1.5 out of 3). Analysis of Variance (ANOVA) statistical test showed statistically significant differences across every subscore domain and composite scores (p=0.00125 for adherence to guidelines, p<0.00001 for detail/clarity, p=0.00003 for sources/seeing a professional, p<0.00001 for composite scores). Qualitatively, the LLMs tended to adhere to guidelines and provide sufficiently detailed management plans, but often did not provide sources and/or advise users to see a healthcare professional. Conclusions: The LLMs nearly always provided detailed management plans for stage two hypertension that adhere to clinical guidelines. However, there was significant variability in quality by different chatbots. Notably, medicine-specific LLMs were not necessarily superior to LLMs used by the general public. AI chatbots may require greater regulation to ensure that they inform users to see a healthcare professional and provide reliable sources.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Tai Metzger

Kody Park

Minh Stephenson Vu

Journals

Circulation

Actions

Institutions

Oakland University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Abstract WE457: Comparative Analysis of Hypertension Management Plans Generated by Large Language Models

Key Points

Structured PICO

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study