IntroductionVisceral aneurysms pose diagnostic and therapeutic challenges in vascular surgery. Large language models (LLMs) may assist in clinical decision-making, but their application requires rigorous validation. Traditional validation methods are labor-intensive and difficult to scale.ObjectiveWe examined the capability of an LLM in managing visceral aneurysms and explored an automated framework for validating AI-generated clinical responses.MethodsUsing Python with the Pandas library and OpenAI API, we probed the Society for Vascular Surgery (SVS) clinical practice guidelines on visceral aneurysm management. ChatGPT-4o-mini was instructed to review guideline recommendations, generate clinical scenarios, propose management strategies, and evaluate its own responses using a four-tier rubric (1 = completely correct; 2 = partially correct; 3 = partially incorrect; 4 = no correct information). Human evaluators independently assessed the same responses and graded questions as good, fair, or poor and whether they were leading.ResultsEighty visceral aneurysm scenarios were generated and evaluated. ChatGPT-4o-mini self-assessed 89% of responses as correct (scores 1-2), compared to 67% by human evaluators (chi-square, P < 0.0001), with the greatest discrepancy in the partially correct category. Most AI-generated questions were of good quality (56%), though 44% were considered leading questions.ConclusionAn automated validation framework for AI-generated clinical responses is feasible. However, the 67% correctness rate and systematic AI self-overestimation indicate that current LLMs remain unsuitable for independent clinical use, reinforcing the need for expert oversight. The integration of Python-driven automation, structured AI inference, and expert review holds promise for increasing the efficiency of evaluating LLMs at-scale across clinical domains.
Building similarity graph...
Analyzing shared references across papers
Loading...
Tamir Bresler
Sari Lada
Tyler Wilson
The American Surgeon
Los Robles Hospital & Medical Center
Building similarity graph...
Analyzing shared references across papers
Loading...
Bresler et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69e1cf985cdc762e9d8588aa — DOI: https://doi.org/10.1177/00031348261443350