8535 Background: Publicly available large language models (LLMs) such as ChatGPT may support clinical decision-making in precision oncology by rapidly synthesizing complex information. We previously evaluated ChatGPT’s ability to generate NCCN-concordant first-line (1L) treatment recommendations for metastatic non-small cell lung cancer (mNSCLC) using a novel Generative AI Performance Score (G-PS). We evaluated whether NCCN-guideline–based training improves LLM treatment recommendations across multiple lines of therapy and targetable genotypes in metastatic NSCLC. Methods: NCCN Guidelines (v3.2025) were reviewed and eight driver alterations with FDA-approved therapies were selected: EGFR Ex19del, BRAF V600E, ALK fusion, KRAS G12C, NTRK1/2/3 fusion, ROS1 fusion, RET fusion, and MET exon 14 skipping. Standardized prompts requesting 1L, second-line (2L), and third-line (3L) recommendations were generated and run through ChatGPT-5.2. Prompts included information on patient demographics, disease stage, and prior therapy where appropriate. Each scenario was repeated five times per line of treatment (N = 15 per mutation). In trained sessions, the LLM was explicitly instructed to defer to an uploaded PDF copy of the NCCN guidelines prior to generating recommendations. Responses were scored using the G-PS, which quantifies guideline concordance on a continuous scale from -1 (all hallucinations) to 1 (all correct answers) based on alignment with NCCN-recommended therapies. Additionally, we calculated ratios for the mean trained G-PS and untrained G-PS across groups, to estimate the relative fold change effect of training on ChatGPT performance (called the “Training Ratio”). Results: A total of 240 prompts were analyzed (120 untrained, 120 trained). NCCN-guided training significantly improved overall LLM guideline concordance (mean G-PS 0.462 vs 0.313, p = 0.049) and reduced irrelevant recommendations (mean irrelevant rate of 23.7% vs 39.8%, p 4) and the poorest in NTRK fusions (0.53). Conclusions: While guideline-based training improves ChatGPT’s overall performance and reduces irrelevant outputs, recommendation quality declines substantially beyond the first line of therapy and shows marked variability by mutation. These findings highlight important limitations of LLMs in complex oncology decision-making and reinforce that clinicians must independently verify recommendations against established guidelines.
Gould et al. (Thu,) studied this question.