What question did this study set out to answer?

To assess the performance of three large language models in automating TNM staging from PET-CT reports in multiple cancer types.

March 7, 2026Open Access

Evaluating large language models for automated TNM staging from PET-CT reports: a multi-cancer comparative study

Key Points

To assess the performance of three large language models in automating TNM staging from PET-CT reports in multiple cancer types.
Analyzed PET-CT reports from 552 treatment-naive cancer patients.
Evaluated three ChatGPT LLMs and five junior radiologists for TNM staging.
Compared accuracy using reference standards from senior radiologists according to AJCC staging system.
ChatGPT 5 achieved the highest accuracy at 82.1%.
ChatGPT 5 processed cases significantly faster than junior radiologists (8.3s vs. 92.5s).
T staging showed the best performance across all models, especially in lung cancer.

Abstract

Purpose To evaluate three large language models (LLMs), including ChatGPT 5, ChatGPT 4o, and ChatGPT 3.5, in automating TNM staging from PET-CT reports across six cancer types, and to assess their clinical utility compared with junior radiologists. Materials and methods PET-CT reports from 552 treatment-naive patients in two institutions with confirmed primary malignancies (lung, breast, liver, pancreatic, renal, and prostate cancer) were analyzed. Three ChatGPT-series LLMs and five junior radiologists independently performed TNM staging. Reference standards were established by two senior radiologists according to the 8th version of American Joint Committee on Cancer (AJCC) staging system. Performance was evaluated using accuracy rates. Intra-model agreement was assessed by repeating each model three times per report with identical prompts, and inter-model agreement was evaluated using Cohen's κ coefficients. Results ChatGPT 5 achieved the highest overall accuracy (82.1%, 453/552), followed by ChatGPT 4o (74.3%, 410/552), both significantly outperforming ChatGPT 3.5 (59.6%, 329/552) and junior radiologists (77.0%, 425/552; p = 0.041 for ChatGPT 5 vs. junior radiologists). Accuracy varied by cancer type, with the highest performance in lung cancer staging (88.5%) and the lowest in pancreatic cancer (69.2%). Across TNM categories, all models achieved the best performance in T staging, followed by N staging, with M staging remaining the most challenging. ChatGPT 5 showed near-perfect intra-model agreement ( κ = 0.96), while inter-model agreement ranged from moderate between ChatGPT 3.5 and 4o ( κ = 0.58) to substantial between ChatGPT 5 and 4o ( κ = 0.78). ChatGPT 5 processed cases markedly faster than junior radiologists (8.3 ± 3.2 vs. 92.5 ± 21.7 s per case; p 0.001). Conclusion Among the three LLMs, ChatGPT 5 demonstrated the highest accuracy, stability, and efficiency in automated TNM staging from PET-CT reports, achieving performance comparable to or slightly exceeding junior radiologists. Its advantages in T staging and lung cancer evaluation highlight its clinical utility as a potential decision-support tool.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Wen Xu

Lixiu Cao

Qijun Shen

Journals

Frontiers in Digital Health

SHILAP Revista de lepidopterología

Actions

Institutions

Hangzhou First People's Hospital

Tangshan People's Hospital

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating large language models for automated TNM staging from PET-CT reports: a multi-cancer comparative study

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study