What question did this study set out to answer?

This research aims to compare the accuracy and empathy of AI-based chatbots when responding to temporomandibular dysfunction queries.

February 26, 2026Open Access

Accuracy and empathy of AI-based conversational chatbots in response to temporomandibular dysfunction related queries

Key Points

This research aims to compare the accuracy and empathy of AI-based chatbots when responding to temporomandibular dysfunction queries.
Developed TMD-related questions categorized into five clinical domains by experts.
Used three AI chatbots to generate responses to 14 questions.
Responses rated for accuracy using the Accuracy of Information index and empathy by subject matter experts.
Trained a BERT-based model on the EPITOME dataset for automated empathy detection.
DeepSeek R1 showed the highest response accuracy among the chatbots.
Overall response accuracy was high, but varied across clinical domains.
Moderate reliability observed among experts in empathy assessments with a correlation of ~0.6.
The BERT empathy model correlated strongly with expert judgments for high-empathy responses.

Abstract

To compare the accuracy and empathy of responses generated by artificial intelligence (AI)-based chatbots to commonly asked temporomandibular dysfunction (TMD)-related questions. Additionally, test the performance of an automated text-based empathy detection model against subject matter experts (SMEs) judgments. TMD-related questions ( n = 14) were developed by a multidisciplinary panel of SMEs and categorized into five clinical domains (Diagnosis and testing, Causes and aggravating factors, Symptoms and associated issues, Treatment options, and Management and prognosis). Free-tier implementations of three AI-based chatbots: ChatGPT GPT-3.5 (CG), Claude 3.5 Sonnet (CD), and DeepSeek R1 (DS) were prompted to generate responses to these questions. Responses were rated for accuracy based on the Accuracy of Information (AOI) index, and empathy using a 3-point scale by the SMEs ( n = 8). To complement expert assessments, a Bidirectional Encoder Representations from Transformers (BERT)-based empathy detection model was trained on the Empathy in Textual Online Medical Exchanges (EPITOME) dataset and validated against SME ratings. DS generated responses with the highest word count (573.6 ± 132.7); significantly more than CG (263.4 ± 63.5) and CD (186.6 ± 25.6). DS also had the highest accuracy across all clinical domains. Overall accuracy of the responses generated by the three chatbots was high. However, variations in accuracy based on clinical domain of the question were observed. Empathy assessments revealed moderate reliability (correlation ~0.6) among SMEs. The BERT model showed strong concordance with SME judgments for high-empathy responses but demonstrated lower agreement for low-empathy categorizations. AI chatbots show promise in providing accurate information regarding TMDs, but their ability to convey empathy remains limited. The observed differences in accuracy and empathy among the three AI chatbots examined are based on a limited dataset and should therefore be interpreted with caution. Current AI chatbots represent an intermediate stage of development, demonstrating adequate technical proficiency while remaining constrained in addressing the humanistic dimensions of patient care. Although empathy detection models may inform future development, significant challenges in empathetic communication persist. • AI-based conversational chatbots are accurate sources of information for temporomandibular disorder-related queries. • Variability in accuracy was observed in certain clinical domains despite the high overall chatbot accuracy. • A novel AI model for automated detection of empathy in text was developed. • For rating empathy in the chatbot responses, the correlation among subject matter experts (SMEs) was moderate. • The AI model developed aligned well with SME judgments for high-empathy responses, but less so for low-empathy ones. • Although AI chatbots provide accurate TMD information, empathetic delivery remains a challenge.

Accuracy and empathy of AI-based conversational chatbots in response to temporomandibular dysfunction related queries

Key Points

Abstract

Cite This Study