What question did this study set out to answer?

The study assesses how consistently AI chatbots select composite shades compared to a dental specialist.

February 28, 2026Open Access

Repeatability of Artificial Intelligence Chatbots in Composite Shade Selection: Agreement with a Dental Specialist

Key Points

The study assesses how consistently AI chatbots select composite shades compared to a dental specialist.
Evaluated three AI chatbots: ChatGPT-4.0, Microsoft Copilot, Claude 3.5.
Standardized photographs of incisor teeth and shade tabs taken under controlled lighting.
Shade selections repeated over five days with identical images and prompts.
Visual agreement assessed by two calibrated evaluators and measured using CIE values.
Intra-model repeatability measured using Fleiss’ kappa; agreement with specialist evaluated using Cohen’s kappa.
ChatGPT-4.0 showed fair repeatability (κ = 0.33).
Claude 3.5 exhibited moderate repeatability (κ = 0.45).
Microsoft Copilot had poor repeatability (κ = −0.12).
ChatGPT-4.0 generally agreed more closely with the specialist than other models.
Microsoft Copilot consistently displayed low agreement with expert evaluations.

Abstract

This study aimed to evaluate the intra-model repeatability of three artificial intelligence-based chatbots (ChatGPT-4.0, Microsoft Copilot, and Claude 3.5) in composite shade selection and their agreement with a dental specialist. Ten acrylic resin maxillary central incisor teeth representing different VITA Classical shades (n = 10) were photographed together with A1, A2, and A3 composite shade tabs under standardized illumination. Shade selections were performed by each artificial intelligence model based on the photographs and repeated on five different days using identical images and prompts. Visual shade selection by the dental specialist was determined by consensus between two calibrated evaluators. CIE L*, a*, and b* values of the acrylic teeth and composite shade tabs were obtained by photometric analysis, and color differences were calculated using the CIEDE2000 formula. Intra-model repeatability was assessed using Fleiss’ kappa coefficient, and agreement with the dental specialist was evaluated using Cohen’s kappa statistic. Intra-model repeatability differed among the models, with ChatGPT-4.0 demonstrating fair repeatability (κ = 0.33), Claude 3.5 showing moderate repeatability (κ = 0.45), and Microsoft Copilot exhibiting poor repeatability (κ = −0.12). Trial-level agreement with the dental specialist varied across repeated assessments, with ChatGPT-4.0 generally demonstrating higher agreement than the other models, whereas Microsoft Copilot showed consistently low agreement. Artificial intelligence chatbots showed variable repeatability and limited agreement with expert evaluation in composite shade selection under standardized conditions.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Özdemir et al. (Fri,) studied this question.

synapsesocial.com/papers/69a288170a974eb0d3c04097 https://doi.org/https://doi.org/10.3390/app16052306

Bookmark

View Full Paper