What question did this study set out to answer?

This study aims to evaluate the effectiveness of a large language model in managing visceral aneurysms and to explore an automated validation framework for AI responses.

April 17, 2026

AI Assessment and Management of Visceral Aneurysms Using ChatGPT-4o-mini: A Pilot Study Examining the Feasibility of Automating the AI Validation Process

Key Points

This study aims to evaluate the effectiveness of a large language model in managing visceral aneurysms and to explore an automated validation framework for AI responses.
Utilized Python with Pandas library and OpenAI API.
Reviewed the Society for Vascular Surgery guidelines on visceral aneurysm management.
Generated clinical scenarios and management strategies using ChatGPT-4o-mini.
Employed a four-tier rubric for assessing AI responses, with human evaluators providing independent assessments.
ChatGPT-4o-mini self-assessed 89% of responses as correct, versus 67% by human evaluators.
Significant discrepancy noted in the partially correct response category.
56% of AI-generated questions deemed good quality, while 44% were considered leading.

Abstract

IntroductionVisceral aneurysms pose diagnostic and therapeutic challenges in vascular surgery. Large language models (LLMs) may assist in clinical decision-making, but their application requires rigorous validation. Traditional validation methods are labor-intensive and difficult to scale.ObjectiveWe examined the capability of an LLM in managing visceral aneurysms and explored an automated framework for validating AI-generated clinical responses.MethodsUsing Python with the Pandas library and OpenAI API, we probed the Society for Vascular Surgery (SVS) clinical practice guidelines on visceral aneurysm management. ChatGPT-4o-mini was instructed to review guideline recommendations, generate clinical scenarios, propose management strategies, and evaluate its own responses using a four-tier rubric (1 = completely correct; 2 = partially correct; 3 = partially incorrect; 4 = no correct information). Human evaluators independently assessed the same responses and graded questions as good, fair, or poor and whether they were leading.ResultsEighty visceral aneurysm scenarios were generated and evaluated. ChatGPT-4o-mini self-assessed 89% of responses as correct (scores 1-2), compared to 67% by human evaluators (chi-square, P < 0.0001), with the greatest discrepancy in the partially correct category. Most AI-generated questions were of good quality (56%), though 44% were considered leading questions.ConclusionAn automated validation framework for AI-generated clinical responses is feasible. However, the 67% correctness rate and systematic AI self-overestimation indicate that current LLMs remain unsuitable for independent clinical use, reinforcing the need for expert oversight. The integration of Python-driven automation, structured AI inference, and expert review holds promise for increasing the efficiency of evaluating LLMs at-scale across clinical domains.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Tamir Bresler

Sari Lada

Tyler Wilson

Journals

The American Surgeon

Actions

Institutions

Los Robles Hospital & Medical Center

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

AI Assessment and Management of Visceral Aneurysms Using ChatGPT-4o-mini: A Pilot Study Examining the Feasibility of Automating the AI Validation Process

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study