What question did this study set out to answer?

The study aims to evaluate the performance of four AI chatbots in successfully answering MCQs focused on operative dentistry.

April 10, 2026Open Access

Performance of 4 artificial intelligence chatbots in responding to multiple choice questions in operative dentistry

Key Points

The study aims to evaluate the performance of four AI chatbots in successfully answering MCQs focused on operative dentistry.
Development of 150 multiple-choice questions based on operative dentistry textbooks.
Evaluation of responses from GPT-4o, Grok 3, Gemini Advanced, and Claude 3.7 Sonnet across two rounds.
Statistical analysis of intra and inter-chatbot consistency using McNemar test and Cohen's Kappa.
Grok 3 and Gemini Advanced achieved 86.4% correct answers in the first round.
GPT-4o and Claude 3.7 Sonnet had 85.5% correct answers initially.
Performance improved in the second round for GPT-4o (87.3%) and Claude 3.7 Sonnet (91.8%).
Intra-chatbot consistency varied from fair to substantial, with GPT-4o showing the highest consistency.

Abstract

The accuracy and consistency of artificial intelligence (AI) based chatbots and their dependability in the field of dental education are questionable. This study was aimed to evaluate the performance of four different chatbots in answering multiple-choice questions (MCQs) in operative dentistry. Relying on textbooks in operative dentistry, a three-membered panel of experts developed 150 MCQs, which a fourth expert screened to yield a final 110 MCQs. These questions were input into GPT-4o, Grok 3, Gemini Advanced and Claude 3.7 Sonnet in two rounds with a gap of one-week interval. The proportion of correct answers reflected the performance of these chatbots. Inter- and intra-chatbot consistencies were analysed using the McNemar test and Cohen’s Kappa. In the first round, Grok 3 and Gemini Advanced answered 86.4% of the MCQs correctly, while GPT-4o and Claude 3.7 Sonnet answered 85.5% correctly. In the second round, the performance of GPT-4o and Claude 3.7 Sonnet improved, answering 87.3% and 91.8%, respectively. Intra-chatbot consistency ranged from fair (Kappa = 0.33) for Claude 3.7 Sonnet to substantial for GPT-4o. Inter-chatbot consistency ranged from 0.34 to 0.54 in the first round and 0.44 to 0.66 in the second round. The assessed chatbots showed promising performance in answering MCQs in operative dentistry and improved over time. The assessed chatbots can be used as adjuncts in the education process of operative dentistry while carefully considering their inherent limitations. Determining the accuracy and, consequently, the dependability of the most widely used AI-based chatbots in responding to dental queries is essential for dental students. Dental students must interpret chatbots’ responses with caution and use them as supplementary tools alongside the standard resources such as textbooks and guidance from mentors.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Thilla Sekar Vinothkumar

Syed Nahid Basheer

Sabari Murugesan

Journals

BMC Oral Health

Actions

Institutions

Primary Health Care

Jazan University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Performance of 4 artificial intelligence chatbots in responding to multiple choice questions in operative dentistry

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study