Abstract Background Large language models (LLMs) offer exciting potential to augment clinical decision-making, but their role in supporting thyroid cancer multidisciplinary tumour board meetings (MDTs) remains uncertain. This study evaluated the performance of three LLMs: ChatGPT, MetaAI, and DeepSeek in reproducing the management decisions of a regional thyroid cancer MDT. Methods 100 thyroid cancer cases were reviewed by a regional MDT comprising of consultant endocrine surgeons, a consultant radiologist, and a consultant histopathologist. MDT outcomes served as the reference standard. Each case was then submitted identically to the three LLMs using a standardised prompt and referencing up-to-date ATA, BTA and UICC guidelines. A 4-point concordance scale was applied: 3 = full agreement; 2 = acceptable alternative; 1 = third-line approach; 0 = discordant/overtreatment. Results ChatGPT achieved the highest number of fully concordant outputs (75×3, 19×2, 2×1, 4×0), with 94% of its responses scoring 2 or 3. DeepSeek produced 64×3, 28×2, 3×1, 5×0 (92% scores 2 or 3). MetaAI produced 61×3, 28×2, 4×1, 7×0 (89% scores 2 or 3). Despite identical prompts and guideline references, the models varied in their outcomes. ChatGPT demonstrated the strongest overall concordance; DeepSeek and MetaAI showed similar strong performances with slightly higher discordance rates. Conclusion All three LLMs demonstrated high concordance with consultant-led MDT decisions in thyroid cancer management. While none can yet be considered reliable for autonomous clinical use, these findings highlight the promising future role of LLMs in MDT decision-support and workflow streamlining once governance, validation, and regulatory frameworks mature.
White et al. (Fri,) studied this question.