ABSTRACT Introduction This study compares the performance of five major large language models (LLMs)—OpenAI ChatGPT 4o and o1, Anthropic Claude 3.5 Sonnet, Google Gemini Advanced 1.5 Pro and Google Gemini Advanced 2.0 Experimental Advanced—on endodontics‐related questions from the Turkish Dentistry Specialization Entrance Exam (DUS) from 2017 to 2024. Method A total of 93 text‐based questions were used to evaluate each model's accuracy in answering specialized dental knowledge queries. Results The results revealed significant differences among the models, with GPT o1 achieving the highest success rate (89.2%) and Gemini Advanced 1.5 Pro the lowest (67.7%). Performance varied by language, with GPT‐4o and GPT o1 showing improved accuracy post‐2022. Additionally, query repetition influenced model responses, with some models exhibiting self‐correction abilities, while others consistently maintained incorrect answers. Conclusion The study highlights the strengths and limitations of current LLMs in domain‐specific assessments, emphasizing the role of reasoning‐based architectures like GPT o1's Chain of Thought (CoT) methodology. These findings underscore the need for continued advancements in AI‐driven education, particularly in dental specialization exams. While LLMs show potential as supplementary tools in dental education, their integration into real‐world applications requires further validation to ensure reliability and domain‐specific proficiency.
Anil Ozgun Karatekin (Thu,) studied this question.