Large language models (LLMs) have shown promising performance for ASA Physical Status (ASA-PS) classification, but prior work suggests reduced agreement in high-risk patients. We evaluated LLM reliability for ASA-PS classification in cardiovascular surgery. Thirty-two anonymized cases were rated by two residents, two board-certified cardiovascular anesthesiologists, and four LLM modes (ChatGPT: GPT-5.2 Instant and GPT-5.2 Thinking; Gemini: Gemini 3 Fast and Gemini 3 High Thinking); all LLM assessments were zero-shot. Overall agreement across evaluators was moderate (intraclass correlation coefficient ICC 0.49–0.52); agreement between each LLM and specialists was good (ICC 0.61–0.65). Exact-match to a five-specialist consensus was 42.2% for residents versus 59.4–75.0% for LLMs; classifications outside the range of ratings assigned by individual specialists were rare (0–3.1%). In cardiovascular surgery, contemporary LLMs showed good concordance with cardiovascular anesthesiologists and exceeded resident agreement with expert consensus, supporting prospective multicenter validation as adjuncts for ASA-PS assessment and training.
Building similarity graph...
Analyzing shared references across papers
Loading...
Keisuke Iwabu
Takashi Juri
Shogo Tsujikawa
JA Clinical Reports
Osaka City University
Osaka City University Hospital
Building similarity graph...
Analyzing shared references across papers
Loading...
Iwabu et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69e1cf985cdc762e9d85886c — DOI: https://doi.org/10.1186/s40981-026-00858-4