What question did this study set out to answer?

The aim is to assess the agreement between AI-generated treatment recommendations and those of a medical oncologist for HR+/HER2- early-stage breast cancer without genomic assays.

May 6, 2026Open Access

Large Language Models as Decision-support Tools for Adjuvant Therapy Planning in Early-stage Hormone Receptor-positive Breast Cancer.

Key Points

The aim is to assess the agreement between AI-generated treatment recommendations and those of a medical oncologist for HR+/HER2- early-stage breast cancer without genomic assays.
Analyzed clinical and pathological data from 411 patients with HR+/HER2- breast cancer.
Used large language models (ChatGPT-4o and ChatGPT-o3) to generate treatment recommendations based on established guidelines.
Evaluated agreement using Fleiss's and Cohen's kappa statistics, along with Cochran's Q test.
Overall agreement between the clinician and models was substantial (κ=0.67).
Moderate agreement noted between clinician and ChatGPT-4o (κ=0.60); ChatGPT-o3 (κ=0.55).
Agreement between the two models was almost perfect (κ=0.88).

Abstract

Background/Aim: Adjuvant treatment decisions in hormone receptor-positive (HR), HER2-negative early-stage breast cancer are frequently guided by multigene assays; however, limited access to genomic testing remains a significant challenge, particularly in resource-limited settings. This study aimed to evaluate the concordance between adjuvant treatment recommendations generated by large language models (ChatGPT-4o and ChatGPT-o3) and those of an experienced medical oncologist in HR+/HER2- early-stage breast cancer patients when genomic assay results were unavailable. Patients and Methods: Clinical and pathological data from 411 patients with HR+/HER2- early-stage breast cancer were provided to ChatGPT-4o and ChatGPT-o3. Both models generated adjuvant treatment recommendations, chemotherapy plus endocrine therapy (CT+ET) or endocrine therapy alone (ET) based on ESMO and NCCN guidelines. These recommendations were compared with those of a medical oncologist. Agreement was assessed using Fleiss's and Cohen's kappa statistics, and differences among evaluators were analyzed using Cochran's Q test. Results: Overall agreement among the clinician and the two models was substantial (κ=0.67). Moderate agreement was observed between the clinician and ChatGPT-4o (κ=0.60) and between the clinician and ChatGPT-o3 (κ=0.55). Agreement between the two language models was almost perfect (κ=0.88). ChatGPT-4o demonstrated closer alignment with clinical judgment. Conclusion: Large language models showed substantial concordance with clinician decision-making in adjuvant therapy planning for HR+/HER2- early-stage breast cancer in the absence of genomic testing. These findings suggest that such models may serve as supportive decision-making tools rather than independent decision-makers, particularly in settings with limited access to multigene assays.

Large Language Models as Decision-support Tools for Adjuvant Therapy Planning in Early-stage Hormone Receptor-positive Breast Cancer.

Key Points

Abstract

Cite This Study