AI agreement with expert consensus in early breast cancer treatment varied widely, with ChatGPT highest at 60% and Claude lowest at 26%, showing poor alignment overall.
Do AI-generated treatment recommendations agree with expert panel consensus in early breast cancer management?
80 distinct clinical scenarios from the final voting of the 2025 St. Gallen International Breast Cancer Conference regarding early-stage breast cancer management (excluding genetic risk and DCIS).
Four Large Language Models (Claude Sonnet 4, Google Gemini 2.5 flash, ChatGPT-4o, and DeepSeek-V3) prompted to answer clinical scenarios.
Expert panel consensus from the 2025 St. Gallen International Breast Cancer Conference.
Agreement rate between AI and expert consensus, defined as identical answers for standard questions.
There is poor alignment between AI and expert medical consensus in early breast cancer treatment decisions, highlighting current limitations in AI's ability to integrate complex clinical reasoning.
Abstract Introduction: Early breast cancer treatment requires complex individual risk assessment integrating tumor biology, staging, and patient factors. This complexity creates heterogeneous treatment approaches globally. Expert panels provide consensus guidance, but artificial intelligence may offer advantages including real-time access to all published data and freedom from individual, institutional, cultural, and emotional biases. This study represents the first comprehensive comparison between AI-generated breast cancer treatment recommendations and expert panel consensus across early-stage breast cancer management. Methods: We analyzed 80 distinct clinical scenarios from the final voting of the 2025 St. Gallen International Breast Cancer Conference. We included breast/axillary surgery, radiation therapy, systemic treatment, elderly care, and recurrence management. We excluded scenarios focusing on genetic risk (e.g. BRCA testing) and DCIS. Clinical scenarios were presented to four Large Language Models (LLMs): Claude Sonnet 4, Google Gemini 2.5 flash, ChatGPT-4o and DeepSeek-V3. Dates of interaction were July 8th and 9th, 2025. We designed three sequential prompts to (1) answer all clinical scenarios selecting single best options, (2) compare AI responses with expert panel percentages and identify disagreements, (3) analyze disagreements with evidence-based rationales for both AI and expert positions. The primary outcome was the agreement rate between AI and expert consensus, defined as identical answers for standard questions. Results: Overall agreement rates varied substantially: ChatGPT 60.0% (48/80), Gemini 57.5% (46/80), DeepSeek 48.8% (39/80), Claude 26.3% (21/80). Agreement differed dramatically by clinical category, ranging from 70.8% (endocrine therapy, n=6) to 25.0% (radiation therapy, n=5). Further, LLMs showed 56.8% agreement in axillary surgery (n=11) and 58.3% sentinel lymph node omission (n=3) and 35.9% in regard to genomic risk scores (n=16). In a second step we unblinded each LLM to the expert opinion and answers of the other LLMs and prompted each LLM to review its discordant answers: in 50-100% (ChatGPT 50% (16/32) and Claude 100% (59/59)) of questions the LLMs revised their initial answers and accepted the panelists’ recommendation. Still, 19%-50% (DeepSeek 19% (8/41) and ChatGPT 50% (16/32)) of discordant answers remained unchanged. Further in-depth analysis regarding the evidence-based reasoning of AI will be presented at the meeting. Conclusion: There was poor alignment between AI and expert medical consensus in early breast cancer treatment decisions. This reveals current limitations in AI's ability to integrate complex and multifactorial clinical reasoning. At the same time, AI offers an unbiased, and comprehensive data-driven perspective, effectively serving as a critical mirror for expert panels. Citation Format: A. Oseledchyk, W. Weber, B. Kasenda. The St. Gallen AI Consensus - Should AI have a vote? abstract. In: Proceedings of the San Antonio Breast Cancer Symposium 2025; 2025 Dec 9-12; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2026;32(4 Suppl):Abstract nr PD11-09.
Building similarity graph...
Analyzing shared references across papers
Loading...
Oseledchyk et al. (Tue,) reported a other. AI agreement with expert consensus in early breast cancer treatment varied widely, with ChatGPT highest at 60% and Claude lowest at 26%, showing poor alignment overall.
www.synapsesocial.com/papers/6996a83eecb39a600b3eec79 — DOI: https://doi.org/10.1158/1557-3265.sabcs25-pd11-09
Anton Oseledchyk
Norbert Wey
Benjamin Kasenda
Clinical Cancer Research
University Hospital of Basel
Building similarity graph...
Analyzing shared references across papers
Loading...