March 3, 2026

Can AI generate safe anaesthesia plans? A comparative evaluation of three large language models on 100 synthetic cases

Key Points

AI can generate anaesthesia plans that may be safe for clinical use, highlighting its potential.
Analysis involved 100 synthetic cases to compare different language models for accuracy and safety.

Structured PICO

Do large language models generate safe and guideline-concordant preoperative anaesthetic plans compared to expert anaesthesiologists in synthetic clinical cases?

Population

100 synthetic clinical cases spanning various surgical specialties

Intervention

Preoperative anaesthetic assessment and plan generation using three Large Language Models (ChatGPT, Mistral, and Dougall GPT)

Comparator

Expert anaesthesiologist reference plans

Outcome

Guideline adherence and clinical safety, measured using a 0-4 expert-derived ordinal scale per domain

LLMs demonstrate potential for routine preoperative anaesthetic planning but are currently insufficiently reliable for high-risk or complex patients.

Abstract

BACKGROUND Preoperative anaesthetic consultation is essential for perioperative care, involving risk assessment, treatment optimisation, and planning of anaesthetic strategies according to established guidelines. Large language models (LLMs) could offer decision-support in this setting, but their autonomous capability to generate comprehensive, guideline-based anaesthetic plans remains unassessed in France and uncertain worldwide. METHODS In this simulation study, 100 synthetic clinical cases spanning various surgical specialties were evaluated. Three AI models-ChatGPT, Mistral, and a domain-specific LLM (Dougall GPT)-were prompted to generate preoperative anaesthetic assessment and plans based on French guidelines. Outputs were compared with expert anaesthesiologist reference plans using structured expert scoring across multiple domains, including guideline adherence and clinical safety. RESULTS Among 1,200 evaluated data fields across 100 cases, ChatGPT showed the highest overall guideline conformity, measured using a 0-4 expert-derived ordinal scale per domain (4 = guideline-concordant). ChatGPT provided the most complete outputs (98% of requested items) and achieved the highest median agreement scores in seven of the 12 anaesthesia domains. Dougall GPT performed moderately, whereas Mistral LeChat showed lower conformity and the highest proportion of unsafe or potentially unsafe outputs (scores ≤2). CONCLUSIONS Current LLMs demonstrate encouraging potential to support preoperative anaesthetic planning for routine cases. However, their reliability remains insufficient for high-risk or complex patients without further fine-tuning and safety controls. These findings underscore both the potential and the current limitations of AI in perioperative decision support.

Bookmark

Cite This Study

Jarrassier et al. (Thu,) studied this question.

synapsesocial.com/papers/69a76772badf0bb9e87e0f77 https://doi.org/https://doi.org/10.1016/j.accpm.2026.101769

Bookmark