Background The cross-lingual and question-type variations affecting large language models (LLMs) accuracy on the Chinese national medical licensing educations remain insufficiently explored. Methods In this cross-sectional study (May 13–20, 2025), 396 educational questions (198 English–Chinese pairs) were extracted from the Chinese national medical licensing examination. ChatGPT-4o, ChatGPT-o3, Gemini-2.5-pro, Deepseek-V3, Deepseek-R1, and Doubao-1.5-pro were prompted to provide answers. Responses were compared against reference answers, and accuracy was computed for three question types: basic knowledge (Type A), case analysis (Type B), and integrative judgment (Type C). Results Across all question types and languages, Doubao-1.5-pro achieved the highest accuracy at 92.0% ± 1.3%, whereas ChatGPT-4o had the lowest accuracy at 82.8% ± 3.7%. There was a significant main effect of question type ( P = 0.0038) but no main effect of language ( P = 0.56). Post hoc tests confirmed that Type A performance exceeded Types B and C ( P < 0.01), while B vs. C did not differ. Among the models, Doubao-1.5-pro, Deepseek-R1, and Deepseek-V3 demonstrated notable cross-lingual stability, with accuracy differences between Chinese and English versions remaining below 5%. Conclusion The question type was a key factor affecting LLMs performance on Chinese medical licensing exam questions, whereas language had no significant impact. Doubao-1.5-pro, Deepseek-R1, and Deepseek-V3 demonstrated particularly strong cross-lingual consistency. These findings point to the potential value of specialized LLMs for enhancing medical education in China.
Building similarity graph...
Analyzing shared references across papers
Loading...
Tang et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d895ea6c1944d70ce07153 — DOI: https://doi.org/10.1371/journal.pone.0346518
Yuxia Tang
Jian Chen
Shouju Wang
PLoS ONE
Building similarity graph...
Analyzing shared references across papers
Loading...