What question did this study set out to answer?

This research aims to assess ChatGPT's ability to correctly answer emergency medicine board examination questions.

April 5, 2026Open Access

Evaluation of ChatGPT’s performance on emergency medicine board examination questions

Key Points

This research aims to assess ChatGPT's ability to correctly answer emergency medicine board examination questions.
Conducted a cross-sectional observational study using 25 standardized questions from the Turkish Board of Emergency Medicine.
Manually input questions into two models: GPT-4 and GPT-4o using OpenAI interface.
Evaluated model responses for accuracy and consistency, focusing on domain-specific errors.
GPT-4 answered 80% of questions accurately on the first attempt, improving to 84% on repetition.
GPT-4o achieved an 88% accuracy rate on its first attempt, showing consistency in repeated assessments.
Identified errors in specific domains: trauma during pregnancy, pediatric resuscitation, and adult resuscitation.

Abstract

Abstract: OBJECTIVES: We aimed to evaluate the performance of a large language model (ChatGPT) in answering official sample questions from the Turkish Board of Emergency Medicine (TBEM). Two versions of the model, GPT-4 and GPT-4o, were assessed to explore consistency and accuracy across iterations. METHODS: A cross-sectional observational study was conducted using 25 standardized multiple-choice questions publicly released by TBEM. Each question was manually entered into GPT-4 and GPT-4o through the OpenAI interface. Both models were prompted to select the best single answer from the provided options without additional clarification or training context. Model responses were evaluated for accuracy, consistency upon repetition, and domain-specific error types. This study is compliant with the STROBE statement and the MedinAI reporting guidelines. RESULTS: GPT-4 correctly answered 20 out of 25 questions (80%) on the first attempt. On repetition, its score improved to 84%. GPT-4o also achieved a score of 88% (22/25) on its first attempt and showed consistent results upon a second evaluation, providing identical answers in both trials. Errors occurred in the domains of trauma during pregnancy, pediatric resuscitation, and adult resuscitation protocols. Both models demonstrated strong performance in fact-based domains and in questions involving image descriptions. CONCLUSION: GPT-4 and GPT-4o performed above the TBEM passing threshold, showing solid accuracy across a range of emergency medicine topics. Both excelled in fact-based and image-related questions. However, they showed limitations in clinical reasoning, particularly in scenarios requiring nuanced judgment. These tools may support examination preparation but should not replace the expertise of trained emergency physicians.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Mustafa Can Güzelce

S. Özgür

İlker Şalli

Journals

Turkish Journal of Emergency Medicine

Actions

Institutions

İzmir University of Economics

Izmir Tepecik Eğitim ve Araştırma Hastanesi

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluation of ChatGPT’s performance on emergency medicine board examination questions

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider