What question did this study set out to answer?

This study aims to evaluate how well three large language models can reproduce the management decisions of a thyroid cancer multidisciplinary tumor board.

May 20, 2026

Assessing the performance of three large language models in thyroid cancer tumour board decision-making

Key Points

This study aims to evaluate how well three large language models can reproduce the management decisions of a thyroid cancer multidisciplinary tumor board.
Reviewed 100 thyroid cancer cases by a regional multidisciplinary team (MDT).
Submitted each case to three large language models using standardized prompts referencing guidelines.
Applied a 4-point concordance scale to assess agreement with MDT decisions.
ChatGPT showed the highest concordance with 94% scoring 2 or 3 on the concordance scale.
DeepSeek and MetaAI demonstrated strong performances, with 92% and 89% scoring 2 or 3, respectively.
All models highlighted differences in outcomes despite using identical prompts and guidelines.

Abstract

Abstract Background Large language models (LLMs) offer exciting potential to augment clinical decision-making, but their role in supporting thyroid cancer multidisciplinary tumour board meetings (MDTs) remains uncertain. This study evaluated the performance of three LLMs: ChatGPT, MetaAI, and DeepSeek in reproducing the management decisions of a regional thyroid cancer MDT. Methods 100 thyroid cancer cases were reviewed by a regional MDT comprising of consultant endocrine surgeons, a consultant radiologist, and a consultant histopathologist. MDT outcomes served as the reference standard. Each case was then submitted identically to the three LLMs using a standardised prompt and referencing up-to-date ATA, BTA and UICC guidelines. A 4-point concordance scale was applied: 3 = full agreement; 2 = acceptable alternative; 1 = third-line approach; 0 = discordant/overtreatment. Results ChatGPT achieved the highest number of fully concordant outputs (75×3, 19×2, 2×1, 4×0), with 94% of its responses scoring 2 or 3. DeepSeek produced 64×3, 28×2, 3×1, 5×0 (92% scores 2 or 3). MetaAI produced 61×3, 28×2, 4×1, 7×0 (89% scores 2 or 3). Despite identical prompts and guideline references, the models varied in their outcomes. ChatGPT demonstrated the strongest overall concordance; DeepSeek and MetaAI showed similar strong performances with slightly higher discordance rates. Conclusion All three LLMs demonstrated high concordance with consultant-led MDT decisions in thyroid cancer management. While none can yet be considered reliable for autonomous clinical use, these findings highlight the promising future role of LLMs in MDT decision-support and workflow streamlining once governance, validation, and regulatory frameworks mature.

Bookmark

Assessing the performance of three large language models in thyroid cancer tumour board decision-making

Key Points

Abstract

Cite This Study