What question did this study set out to answer?

This study evaluates whether large language models can reliably score the aesthetic component of orthodontic treatment needs, matching expert orthodontists' assessments.

May 29, 2026Open Access

Aesthetic Component of the Index of Orthodontic Treatment Need (IOTN): Can large language models match the performance of orthodontists?

Q: What does this research mean for the field?

Multimodal large language models, particularly GPT-4.0, demonstrate substantial agreement with expert orthodontists in scoring the Aesthetic Component of the Index of Orthodontic Treatment Need (IOTN-AC), indicating their potential as adjunct clinical tools. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

Key Points

This study evaluates whether large language models can reliably score the aesthetic component of orthodontic treatment needs, matching expert orthodontists' assessments.
Cross-sectional diagnostic agreement design with 147 standardized photographs scored by three orthodontists.
Assessment at two time points, analyzing agreement and reliability using ICCs, correlation values, and mean absolute error.
Evaluation of five large language models (LLMs) against expert scores for IOTN-AC.
GPT-4.0 showed the highest agreement with orthodontists, achieving an exact match rate of 28.6%.
It also had the lowest mean absolute error of 1.09 compared to expert scores.
Overall, LLMs demonstrated substantial agreement, indicating potential as adjunct tools for treatment need assessments.

Abstract

Objective: The assessment of orthodontic treatment needs often involves subjective judgment, particularly when using esthetic indices such as the Aesthetic Component (AC) of the Index of Orthodontic Treatment Need (IOTN). This cross-sectional diagnostic agreement study evaluated whether large language models (LLMs) could provide consistent and reliable IOTN-AC scores comparable to those assigned by expert orthodontists. Methods: Three experienced orthodontists (with 8-15 years of clinical experience) independently scored 147 standardized frontal intraoral photographs using the IOTN-AC at two time points (Time 1 and Time 2). Five LLMs (GPT-4.0, GPT-o3, Claude, Manus, and Grok) were used to evaluate the dataset. Agreement and reliability were assessed using intraclass correlation coefficients (ICCs), Pearson and Spearman correlation values, mean absolute error (MAE), and match analyses (exact, near, and group matches). Results: < 0.001) and the lowest MAE (1.09). In the match analyses, GPT-4.0 achieved the highest exact match (28.6%), near-match (47.6%), and group match (66.0%) rates. The remaining models showed lower performance across all metrics. Conclusions: Multimodal LLMs, particularly GPT-4.0, demonstrated substantial agreement with expert orthodontists in IOTN-AC scoring. These findings suggest that LLMs may serve as adjunct tools in assessments for orthodontic treatment needs. However, clinical decision-making should continue to rely on expert judgment.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Arısan et al. (Tue,) studied this question.

synapsesocial.com/papers/6a192cb4fab5b468c44157fc https://doi.org/https://doi.org/10.4041/kjod25.320

Bookmark

View Full Paper