To compare the accuracy and empathy of responses generated by artificial intelligence (AI)-based chatbots to commonly asked temporomandibular dysfunction (TMD)-related questions. Additionally, test the performance of an automated text-based empathy detection model against subject matter experts (SMEs) judgments. TMD-related questions ( n = 14) were developed by a multidisciplinary panel of SMEs and categorized into five clinical domains (Diagnosis and testing, Causes and aggravating factors, Symptoms and associated issues, Treatment options, and Management and prognosis). Free-tier implementations of three AI-based chatbots: ChatGPT GPT-3.5 (CG), Claude 3.5 Sonnet (CD), and DeepSeek R1 (DS) were prompted to generate responses to these questions. Responses were rated for accuracy based on the Accuracy of Information (AOI) index, and empathy using a 3-point scale by the SMEs ( n = 8). To complement expert assessments, a Bidirectional Encoder Representations from Transformers (BERT)-based empathy detection model was trained on the Empathy in Textual Online Medical Exchanges (EPITOME) dataset and validated against SME ratings. DS generated responses with the highest word count (573.6 ± 132.7); significantly more than CG (263.4 ± 63.5) and CD (186.6 ± 25.6). DS also had the highest accuracy across all clinical domains. Overall accuracy of the responses generated by the three chatbots was high. However, variations in accuracy based on clinical domain of the question were observed. Empathy assessments revealed moderate reliability (correlation ~0.6) among SMEs. The BERT model showed strong concordance with SME judgments for high-empathy responses but demonstrated lower agreement for low-empathy categorizations. AI chatbots show promise in providing accurate information regarding TMDs, but their ability to convey empathy remains limited. The observed differences in accuracy and empathy among the three AI chatbots examined are based on a limited dataset and should therefore be interpreted with caution. Current AI chatbots represent an intermediate stage of development, demonstrating adequate technical proficiency while remaining constrained in addressing the humanistic dimensions of patient care. Although empathy detection models may inform future development, significant challenges in empathetic communication persist. • AI-based conversational chatbots are accurate sources of information for temporomandibular disorder-related queries. • Variability in accuracy was observed in certain clinical domains despite the high overall chatbot accuracy. • A novel AI model for automated detection of empathy in text was developed. • For rating empathy in the chatbot responses, the correlation among subject matter experts (SMEs) was moderate. • The AI model developed aligned well with SME judgments for high-empathy responses, but less so for low-empathy ones. • Although AI chatbots provide accurate TMD information, empathetic delivery remains a challenge.
Shehab et al. (Sun,) studied this question.