Background. Large language models (LLMs) are increasingly proposed as clinical decision support tools in intensive care. However, most existing evaluations focus on static medical knowledge recall and do not account for model behavior under dynamic clinical dialogue, social pressure, or ethical conflict. The safety of LLM-based systems under these conditions remains insufficiently characterized. Methods. We conducted a large-scale comparative benchmark of 12 language models across more than 58, 000 automated consultation logs generated within a three-role multi-agent architecture (Clinician — Expert — Judge). Clinical inference was performed locally on consumer-grade hardware. Evaluation was conducted by two independent LLM-based judges: a local GPT-OSS model (35, 149 evaluations) and the Gemini API (23, 300 evaluations). Testing covered 3 ethical profiles, 4 case types, and 9 clinical domains. The final GOLDEN dataset (n=3, 680) was assembled using four quality criteria with priority given to the stricter judge. Results. Model accuracy ranged from 11. 1% to 77. 0%. Specialized medical models systematically underperformed general-purpose models (11. 1–13. 2% vs. 70–77%). We identified an ethics-memory dissociation phenomenon: memory failure rates increased from 1. 8% in routine scenarios to 28. 2% under ethical pressure (15. 7×). Sycophancy under targeted pressure reached 40. 5% vs. 8. 1% in standard scenarios. The most restrictive ethical profile (strictᵥ1) demonstrated the lowest accuracy (45. 9%) — a paranoia-overfitting phenomenon quantified by the proposed GQI metric. Conclusions. Clinical LLM safety is determined not by medical specialization or restrictiveness of ethical constraints, but by architectural resilience to social manipulation and memory integrity under pressure. The proposed case type taxonomy and GQI metric may serve as a methodological foundation for standardized clinical AI validation.
Building similarity graph...
Analyzing shared references across papers
Loading...
Taras Shlyakhta (Mon,) studied this question.
www.synapsesocial.com/papers/69f1a051edf4b46824807032 — DOI: https://doi.org/10.5281/zenodo.19828812
Taras Shlyakhta
Linde (United States)
Galena Biopharma (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...