What question did this study set out to answer?

The aim is to evaluate chatbot performance in providing accurate and readable responses to patient questions about endodontic instruments.

April 15, 2026Open Access

Large language models’ performances regarding common patient questions about broken endodontic instruments: a comparative analysis of ChatGPT-5.2, Gemini 3 and DeepSeek V3.2 in accuracy, consistency and readability

Key Points

The aim is to evaluate chatbot performance in providing accurate and readable responses to patient questions about endodontic instruments.
Five chatbots were tested: ChatGPT-5.2, ChatGPT-5.2 Plus, Gemini 3.0, Gemini 3.0 Plus, and DeepSeek V3.2.
Twenty-two questions related to broken endodontic instruments were developed by experienced endodontists.
Responses were evaluated for accuracy on a 1–5 scale, with consistency assessed using the coefficient of variation.
Readability was measured using multiple indices, including Flesch-Kincaid and Gunning Fog Score.
Significant differences in accuracy were found among chatbots (p < 0.05), with ChatGPT-5.2 being significantly less accurate (p < 0.001).
Higher accuracy was noted on day 2 compared to other days (p < 0.001).
Gemini 3 Plus and DeepSeek V3.2 exhibited higher consistency than ChatGPT-5.2.
ChatGPT-5.2 Plus was noted to generate more readable content, whereas Gemini and DeepSeek required higher reading levels.

Abstract

This study aims to evaluate and compare the performance of patient education materials generated by five widely used chatbots, ChatGPT-5.2, ChatGPT-5.2 Plus, Gemini 3.0, Gemini 3.0 Plus and DeepSeek V3.2, on answering questions related to broken endodontic instruments in root canals. Twenty-two questions were formulated by two endodontists, each with eight years of experience in instrument removal procedures, based on their clinical expertise and educational materials from the American Association of Endodontists (AAE). The questions were posed to the chatbots over a period of five days, at three different times each day (morning, afternoon, and evening). Two blinded evaluators independently assessed responses for accuracy using a 1–5 scale. Disagreements on scoring were resolved through evidence-based discussions. Coefficient of variation (CV) was calculated to evaluate the consistency of repeated responses for each chatbot. Readability was evaluated using the Flesch Kincaid Reading Ease Score, Flesch Kincaid Grade Level, Gunning Fog Score, and SMOG Indices. Significant differences in accuracy were found among the chatbots (p < 0.05), with ChatGPT-5.2 demonstrating lower accuracy than the other models (p < 0.001). Accuracy was higher on day 2 than on the other days (p < 0.001). Consistency scores differed significantly among models (p < 0.05), with Gemini 3 Plus and DeepSeek V3.2 showing higher consistency than ChatGPT-5.2. Readability analysis indicated that ChatGPT-5.2 Plus generated more readable responses, whereas Gemini and DeepSeek V3.2 required higher reading grade levels. Large language models (LLMs)-based chatbots showed model-dependent differences in accuracy, consistency, and readability. While Gemini 3, Gemini 3 Plus and DeepSeek V3.2 performed better in terms of accuracy and consistency, ChatGPT-5.2 and ChatGPT-5.2 Plus provided more readable content, highlighting the need for cautious and selective use of these tools in patient education.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Sümbüllü et al. (Mon,) studied this question.

www.synapsesocial.com/papers/69df2ae6e4eeef8a2a6afe70 — DOI: https://doi.org/10.1186/s12903-026-08327-1

Authors

Meltem Sümbüllü

Elham Othman Adam

EMİNE ARAZ ALTUN

Journals

BMC Oral Health

Actions

Institutions

Atatürk University

Istanbul Medeniyet University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Large language models’ performances regarding common patient questions about broken endodontic instruments: a comparative analysis of ChatGPT-5.2, Gemini 3 and DeepSeek V3.2 in accuracy, consistency and readability

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion