What question did this study set out to answer?

The central aim is to compare the performance of five large language models on endodontics questions from a dentistry specialization exam.

March 21, 2026

Comparison and Review of Different Versions of OpenAI Chat GPT , Anthropic Claude and Google Gemini Large Language Models' Performance on Endodontics Questions in the Turkish Dentistry Specialization Exam

Key Points

The central aim is to compare the performance of five large language models on endodontics questions from a dentistry specialization exam.
Evaluated five LLMs: ChatGPT versions, Anthropic Claude, and Google Gemini.
Used 93 text-based dental questions from the Turkish Dentistry Specialization Entrance Exam.
Assessed accuracy of models' responses regarding specialized knowledge.
ChatGPT o1 achieved the highest accuracy at 89.2%; Gemini Advanced 1.5 Pro scored lowest at 67.7%.
Performance varied by language and model version, especially post-2022.
Some models showed self-correction, while others provided persistently incorrect answers.

Abstract

ABSTRACT Introduction This study compares the performance of five major large language models (LLMs)—OpenAI ChatGPT 4o and o1, Anthropic Claude 3.5 Sonnet, Google Gemini Advanced 1.5 Pro and Google Gemini Advanced 2.0 Experimental Advanced—on endodontics‐related questions from the Turkish Dentistry Specialization Entrance Exam (DUS) from 2017 to 2024. Method A total of 93 text‐based questions were used to evaluate each model's accuracy in answering specialized dental knowledge queries. Results The results revealed significant differences among the models, with GPT o1 achieving the highest success rate (89.2%) and Gemini Advanced 1.5 Pro the lowest (67.7%). Performance varied by language, with GPT‐4o and GPT o1 showing improved accuracy post‐2022. Additionally, query repetition influenced model responses, with some models exhibiting self‐correction abilities, while others consistently maintained incorrect answers. Conclusion The study highlights the strengths and limitations of current LLMs in domain‐specific assessments, emphasizing the role of reasoning‐based architectures like GPT o1's Chain of Thought (CoT) methodology. These findings underscore the need for continued advancements in AI‐driven education, particularly in dental specialization exams. While LLMs show potential as supplementary tools in dental education, their integration into real‐world applications requires further validation to ensure reliability and domain‐specific proficiency.

Bookmark

Comparison and Review of Different Versions of OpenAI Chat GPT , Anthropic Claude and Google Gemini Large Language Models' Performance on Endodontics Questions in the Turkish Dentistry Specialization Exam

Key Points

Abstract

Cite This Study