Abstract: OBJECTIVES: We aimed to evaluate the performance of a large language model (ChatGPT) in answering official sample questions from the Turkish Board of Emergency Medicine (TBEM). Two versions of the model, GPT-4 and GPT-4o, were assessed to explore consistency and accuracy across iterations. METHODS: A cross-sectional observational study was conducted using 25 standardized multiple-choice questions publicly released by TBEM. Each question was manually entered into GPT-4 and GPT-4o through the OpenAI interface. Both models were prompted to select the best single answer from the provided options without additional clarification or training context. Model responses were evaluated for accuracy, consistency upon repetition, and domain-specific error types. This study is compliant with the STROBE statement and the MedinAI reporting guidelines. RESULTS: GPT-4 correctly answered 20 out of 25 questions (80%) on the first attempt. On repetition, its score improved to 84%. GPT-4o also achieved a score of 88% (22/25) on its first attempt and showed consistent results upon a second evaluation, providing identical answers in both trials. Errors occurred in the domains of trauma during pregnancy, pediatric resuscitation, and adult resuscitation protocols. Both models demonstrated strong performance in fact-based domains and in questions involving image descriptions. CONCLUSION: GPT-4 and GPT-4o performed above the TBEM passing threshold, showing solid accuracy across a range of emergency medicine topics. Both excelled in fact-based and image-related questions. However, they showed limitations in clinical reasoning, particularly in scenarios requiring nuanced judgment. These tools may support examination preparation but should not replace the expertise of trained emergency physicians.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mustafa Can Güzelce
S. Özgür
İlker Şalli
Turkish Journal of Emergency Medicine
İzmir University of Economics
Izmir Tepecik Eğitim ve Araştırma Hastanesi
Building similarity graph...
Analyzing shared references across papers
Loading...
Güzelce et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69d1fde4a79560c99a0a445a — DOI: https://doi.org/10.4103/tjem.tjem_262_25
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: