What question did this study set out to answer?

Assess the agreement between AI-generated treatment recommendations and expert consensus in early breast cancer management.

February 19, 2026

Abstract PD11-09: The St. Gallen AI Consensus - Should AI have a vote?

Key Result

AI agreement with expert consensus in early breast cancer treatment varied widely, with ChatGPT highest at 60% and Claude lowest at 26%, showing poor alignment overall.

Key Points

Assess the agreement between AI-generated treatment recommendations and expert consensus in early breast cancer management.
Analyzed 80 clinical scenarios from the St. Gallen International Breast Cancer Conference voting.
Engaged four Large Language Models (LLMs) to generate treatment options for comparison.
Identified disagreements between AI responses and expert panel recommendations.
ChatGPT had the highest agreement rate at 60.0% with experts.
Agreement rates varied significantly by treatment category; endocrine therapy had 70.8%.
50-100% of LLMs revised discordant answers upon reviewing expert recommendations.

Structured PICO

Do AI-generated treatment recommendations agree with expert panel consensus in early breast cancer management?

Population

80 distinct clinical scenarios from the final voting of the 2025 St. Gallen International Breast Cancer Conference regarding early-stage breast cancer management (excluding genetic risk and DCIS).

Intervention

Four Large Language Models (Claude Sonnet 4, Google Gemini 2.5 flash, ChatGPT-4o, and DeepSeek-V3) prompted to answer clinical scenarios.

Comparator

Expert panel consensus from the 2025 St. Gallen International Breast Cancer Conference.

Outcome

Agreement rate between AI and expert consensus, defined as identical answers for standard questions.

There is poor alignment between AI and expert medical consensus in early breast cancer treatment decisions, highlighting current limitations in AI's ability to integrate complex clinical reasoning.

Abstract

Abstract Introduction: Early breast cancer treatment requires complex individual risk assessment integrating tumor biology, staging, and patient factors. This complexity creates heterogeneous treatment approaches globally. Expert panels provide consensus guidance, but artificial intelligence may offer advantages including real-time access to all published data and freedom from individual, institutional, cultural, and emotional biases. This study represents the first comprehensive comparison between AI-generated breast cancer treatment recommendations and expert panel consensus across early-stage breast cancer management. Methods: We analyzed 80 distinct clinical scenarios from the final voting of the 2025 St. Gallen International Breast Cancer Conference. We included breast/axillary surgery, radiation therapy, systemic treatment, elderly care, and recurrence management. We excluded scenarios focusing on genetic risk (e.g. BRCA testing) and DCIS. Clinical scenarios were presented to four Large Language Models (LLMs): Claude Sonnet 4, Google Gemini 2.5 flash, ChatGPT-4o and DeepSeek-V3. Dates of interaction were July 8th and 9th, 2025. We designed three sequential prompts to (1) answer all clinical scenarios selecting single best options, (2) compare AI responses with expert panel percentages and identify disagreements, (3) analyze disagreements with evidence-based rationales for both AI and expert positions. The primary outcome was the agreement rate between AI and expert consensus, defined as identical answers for standard questions. Results: Overall agreement rates varied substantially: ChatGPT 60.0% (48/80), Gemini 57.5% (46/80), DeepSeek 48.8% (39/80), Claude 26.3% (21/80). Agreement differed dramatically by clinical category, ranging from 70.8% (endocrine therapy, n=6) to 25.0% (radiation therapy, n=5). Further, LLMs showed 56.8% agreement in axillary surgery (n=11) and 58.3% sentinel lymph node omission (n=3) and 35.9% in regard to genomic risk scores (n=16). In a second step we unblinded each LLM to the expert opinion and answers of the other LLMs and prompted each LLM to review its discordant answers: in 50-100% (ChatGPT 50% (16/32) and Claude 100% (59/59)) of questions the LLMs revised their initial answers and accepted the panelists’ recommendation. Still, 19%-50% (DeepSeek 19% (8/41) and ChatGPT 50% (16/32)) of discordant answers remained unchanged. Further in-depth analysis regarding the evidence-based reasoning of AI will be presented at the meeting. Conclusion: There was poor alignment between AI and expert medical consensus in early breast cancer treatment decisions. This reveals current limitations in AI's ability to integrate complex and multifactorial clinical reasoning. At the same time, AI offers an unbiased, and comprehensive data-driven perspective, effectively serving as a critical mirror for expert panels. Citation Format: A. Oseledchyk, W. Weber, B. Kasenda. The St. Gallen AI Consensus - Should AI have a vote? abstract. In: Proceedings of the San Antonio Breast Cancer Symposium 2025; 2025 Dec 9-12; San Antonio, TX. Philadelphia (PA): AACR; Clin Cancer Res 2026;32(4 Suppl):Abstract nr PD11-09.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Oseledchyk et al. (Tue,) reported a other. AI agreement with expert consensus in early breast cancer treatment varied widely, with ChatGPT highest at 60% and Claude lowest at 26%, showing poor alignment overall.

www.synapsesocial.com/papers/6996a83eecb39a600b3eec79 — DOI: https://doi.org/10.1158/1557-3265.sabcs25-pd11-09

Authors

Anton Oseledchyk

Norbert Wey

Benjamin Kasenda

Journals

Clinical Cancer Research

Actions

Institutions

University Hospital of Basel

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Abstract PD11-09: The St. Gallen AI Consensus - Should AI have a vote?

Key Result

Key Points

Structured PICO

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion