What question did this study set out to answer?

Evaluate the performance of large language models on the Taiwan Neurology Board Examination.

March 7, 2026Open Access

Benchmarking Large Language Models on the Taiwan Neurology Board Examinations (2018–2024): A Comparative Evaluation of GPT-4o, GPT-o1, DeepSeek-V3, and DeepSeek-R1

Key Points

Evaluate the performance of large language models on the Taiwan Neurology Board Examination.
Analyzed 1715 questions from the Taiwan Neurology Board Examination (2018–2024)
Compared four LLMs: GPT-4o, GPT-o1, DeepSeek-V3, and DeepSeek-R1
Assessed performance across multiple formats: single-choice, multiple-choice, true-false, and image-based items
Conducted statistical analyses to evaluate inter-model differences.
GPT-o1 achieved the highest accuracy at 83.86%
DeepSeek-V3 had the lowest accuracy at 65.62% with high variability
All models experienced a decline in accuracy in 2024 due to changes in question design
DeepSeek-R1 faced additional score loss due to alignment-based refusals.

Abstract

Background and Purpose: Neurology requires integration of clinical reasoning, imaging interpretation, and current knowledge, making it an ideal field for evaluating large language models (LLMs). Methods: Using 1715 questions from the Taiwan Neurology Board Examination (2018–2024), we assessed four LLMs—GPT-4o, GPT-o1, DeepSeek-V3, and DeepSeek-R1—across four formats: single-choice, multiple-choice, true–false, and image-based items. Results: GPT-o1 achieved the highest overall accuracy (83.86%) and demonstrated strong performance on cognitively demanding tasks (82.50% on true–false; 77.26% on image-based). DeepSeek-V3 scored lowest (65.62%) and showed the greatest variability. Statistical analyses confirmed significant inter-model differences (p < 0.01). Accuracy declined across all models in 2024, coinciding with shifts in question design. DeepSeek-R1 was further penalized by alignment-based refusals, resulting in up to 3.81% score loss. Conclusions: These results position the Taiwan Neurology Board Exam as a rigorous benchmark for LLM evaluation and underscore GPT-o1’s potential utility in neurology education and decision support.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Lin et al. (Thu,) studied this question.

synapsesocial.com/papers/69abc1d75af8044f7a4eacd4 https://doi.org/https://doi.org/10.3390/bioengineering13030302

Bookmark

View Full Paper