Background Large language models (LLMs) have shown growing potential for medical education and assessment, but evidence on their performance in specialty certification exams in China—particularly in ultrasound medicine—remains limited. Objective To compare the performance of ChatGPT-5 and DeepSeek on the Chinese Ultrasound Medicine Senior Professional Title Examination, overall and by item type. Methods Between August and September 2025, we randomly selected 100 multiple-choice questions from the official Chinese Ultrasound Medicine Senior Professional Title Examination bank (60 image-based interpretation items and 40 text-based items). We evaluated ChatGPT-5 and DeepSeek using identical prompts through their public web interfaces. The primary outcome was overall accuracy; secondary outcomes were accuracy by item type and subspecialty. Between-model differences were assessed using two-proportion z -tests ( α = 0.05) in Python 3.12. Results Overall accuracy was higher for ChatGPT-5 than for DeepSeek 74.0% (74/100) vs. 60.0% (60/100); p = 0.035. Accuracy on image-based items was also higher for ChatGPT-5 (61.7% vs. 40.0%; p = 0.018). Performance on text-based items was similar for both models (92.5% vs. 90.0%). Subspecialty patterns varied across domains; however, no between-model differences reached statistical significance. Conclusions ChatGPT-5 outperformed DeepSeek on image-based items (61.7% vs. 40.0%), while both models performed similarly on text-based knowledge items (92.5% vs. 90.0%). Overall, both LLMs showed strong performance on Chinese ultrasound senior-title examination questions, with complementary strengths across content areas. They may be useful as supplementary educational tools, but further advances in multimodal reasoning are needed to support more reliable image interpretation.
Hong et al. (Mon,) studied this question.