What question did this study set out to answer?

The aim is to compare the performance of two language models, ChatGPT-5 and DeepSeek, on a certification exam in ultrasound medicine.

March 12, 2026Open Access

Comparative performance of ChatGPT-5 and DeepSeek on the Chinese ultrasound medicine senior professional title examination

Key Points

The aim is to compare the performance of two language models, ChatGPT-5 and DeepSeek, on a certification exam in ultrasound medicine.
Randomly selected 100 multiple-choice questions from the official exam bank.
Evaluated performance using identical prompts across two models.
Analyzed overall accuracy and accuracy by item type and subspecialty.
Conducted statistical analysis using two-proportion z-tests in Python.
ChatGPT-5 achieved higher overall accuracy than DeepSeek (74.0% vs. 60.0%).
ChatGPT-5 outperformed DeepSeek on image-based items (61.7% vs. 40.0%).
Both models showed similar performance on text-based items (92.5% vs. 90.0%).
No significant differences were found in subspecialty patterns across models.

Abstract

Background Large language models (LLMs) have shown growing potential for medical education and assessment, but evidence on their performance in specialty certification exams in China—particularly in ultrasound medicine—remains limited. Objective To compare the performance of ChatGPT-5 and DeepSeek on the Chinese Ultrasound Medicine Senior Professional Title Examination, overall and by item type. Methods Between August and September 2025, we randomly selected 100 multiple-choice questions from the official Chinese Ultrasound Medicine Senior Professional Title Examination bank (60 image-based interpretation items and 40 text-based items). We evaluated ChatGPT-5 and DeepSeek using identical prompts through their public web interfaces. The primary outcome was overall accuracy; secondary outcomes were accuracy by item type and subspecialty. Between-model differences were assessed using two-proportion z -tests ( α = 0.05) in Python 3.12. Results Overall accuracy was higher for ChatGPT-5 than for DeepSeek 74.0% (74/100) vs. 60.0% (60/100); p = 0.035. Accuracy on image-based items was also higher for ChatGPT-5 (61.7% vs. 40.0%; p = 0.018). Performance on text-based items was similar for both models (92.5% vs. 90.0%). Subspecialty patterns varied across domains; however, no between-model differences reached statistical significance. Conclusions ChatGPT-5 outperformed DeepSeek on image-based items (61.7% vs. 40.0%), while both models performed similarly on text-based knowledge items (92.5% vs. 90.0%). Overall, both LLMs showed strong performance on Chinese ultrasound senior-title examination questions, with complementary strengths across content areas. They may be useful as supplementary educational tools, but further advances in multimodal reasoning are needed to support more reliable image interpretation.

Bookmark

View Full Paper

Cite This Study

Hong et al. (Mon,) studied this question.

synapsesocial.com/papers/69b2581996eeacc4fcec7697 https://doi.org/https://doi.org/10.3389/fdgth.2026.1783347

Bookmark

View Full Paper