Medical large language models (LLMs) achieving high benchmark accuracy exhibit unexplained variability in clinical tasks, producing errors that clinicians cannot safeguard against. We evaluated clinical reasoning stability in GPT-5, MedGemma-27B-Text-IT, and OpenBioLLM-Llama3-70B using 355 systematic perturbations of physician-validated oncology cases and trained sparse autoencoders on 1 billion tokens from 50,000 MIMIC-IV clinical notes to decompose their internal representation. We find models exhibit dramatic reasoning instability, shifting staging accuracy by over 50% based solely on prompt format, or generating definitive staging in clinically insufficient scenarios. Sparse autoencoder analysis revealed hierarchical encoding in MedGemma, where high-magnitude features encode lexical identity and low-magnitude features encode contextual meaning. OpenBioLLM distributes information uniformly. We demonstrate these internal encoding structures differentially affect retrieval interventions, suggesting interventions effective for one architecture may harm another. We recommend healthcare institutions implement architecture-specific safety validation, as benchmark equivalence does not imply functional equivalence, with implications for AI safety beyond healthcare.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mirage Modi
Jordan E. Krull
Donte Johnson
The Ohio State University
The Ohio State University Comprehensive Cancer Center – Arthur G. James Cancer Hospital and Richard J. Solove Research Institute
Building similarity graph...
Analyzing shared references across papers
Loading...
Modi et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69a75b87c6e9836116a22eff — DOI: https://doi.org/10.64898/2026.01.26.26344845
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: