What question did this study set out to answer?

April 17, 2026Open Access

Evaluation of the reliability of large language models for ASA-PS classification in cardiovascular surgery: a pilot study

Key Points

This study aims to assess the reliability of large language models for ASA-PS classification in cardiovascular surgery.
Rated 32 anonymized cases by two residents and two board-certified cardiovascular anesthesiologists.
Evaluated four large language model modes including ChatGPT and Gemini.
Utilized zero-shot evaluation for model assessments.
Calculated overall agreement using intraclass correlation coefficients.
Moderate overall agreement among evaluators (ICC 0.49–0.52).
Good agreement between LLMs and specialists (ICC 0.61–0.65).
Exact-match rates were 42.2% for residents and 59.4–75.0% for LLMs.
Classifications outside expert ranges were rare (0–3.1%).

Abstract

Large language models (LLMs) have shown promising performance for ASA Physical Status (ASA-PS) classification, but prior work suggests reduced agreement in high-risk patients. We evaluated LLM reliability for ASA-PS classification in cardiovascular surgery. Thirty-two anonymized cases were rated by two residents, two board-certified cardiovascular anesthesiologists, and four LLM modes (ChatGPT: GPT-5.2 Instant and GPT-5.2 Thinking; Gemini: Gemini 3 Fast and Gemini 3 High Thinking); all LLM assessments were zero-shot. Overall agreement across evaluators was moderate (intraclass correlation coefficient ICC 0.49–0.52); agreement between each LLM and specialists was good (ICC 0.61–0.65). Exact-match to a five-specialist consensus was 42.2% for residents versus 59.4–75.0% for LLMs; classifications outside the range of ratings assigned by individual specialists were rare (0–3.1%). In cardiovascular surgery, contemporary LLMs showed good concordance with cardiovascular anesthesiologists and exceeded resident agreement with expert consensus, supporting prospective multicenter validation as adjuncts for ASA-PS assessment and training.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Keisuke Iwabu

Takashi Juri

Shogo Tsujikawa

Journals

JA Clinical Reports

Actions

Institutions

Osaka City University

Osaka City University Hospital

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluation of the reliability of large language models for ASA-PS classification in cardiovascular surgery: a pilot study

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study