Objective: The assessment of orthodontic treatment needs often involves subjective judgment, particularly when using esthetic indices such as the Aesthetic Component (AC) of the Index of Orthodontic Treatment Need (IOTN). This cross-sectional diagnostic agreement study evaluated whether large language models (LLMs) could provide consistent and reliable IOTN-AC scores comparable to those assigned by expert orthodontists. Methods: Three experienced orthodontists (with 8-15 years of clinical experience) independently scored 147 standardized frontal intraoral photographs using the IOTN-AC at two time points (Time 1 and Time 2). Five LLMs (GPT-4.0, GPT-o3, Claude, Manus, and Grok) were used to evaluate the dataset. Agreement and reliability were assessed using intraclass correlation coefficients (ICCs), Pearson and Spearman correlation values, mean absolute error (MAE), and match analyses (exact, near, and group matches). Results: < 0.001) and the lowest MAE (1.09). In the match analyses, GPT-4.0 achieved the highest exact match (28.6%), near-match (47.6%), and group match (66.0%) rates. The remaining models showed lower performance across all metrics. Conclusions: Multimodal LLMs, particularly GPT-4.0, demonstrated substantial agreement with expert orthodontists in IOTN-AC scoring. These findings suggest that LLMs may serve as adjunct tools in assessments for orthodontic treatment needs. However, clinical decision-making should continue to rely on expert judgment.
Arısan et al. (Tue,) studied this question.