Dear Editor, We read with great interest the recently published systematic review and meta-analysis evaluating the diagnostic performance of artificial intelligence (AI) models for strabismus screening1. The authors should be commended for providing a comprehensive synthesis of the current evidence, including detailed subgroup analyses according to algorithm architecture, validation strategy, data augmentation, sample size, and data modality. This work highlights the potential of AI as a screening tool for strabismus while also identifying key methodological factors influencing diagnostic performance. Nevertheless, several methodological and interpretative issues warrant careful consideration before these findings can be generalized to broader clinical practice. First, the included studies were geographically imbalanced, with two-thirds conducted in China, which may limit the generalizability of the results. The prevalence and presentation of strabismus can vary across populations due to factors such as craniofacial anatomy, eye pigmentation, and environmental influences2. However, studies from other regions were underrepresented, covering only four countries in total. This raises the possibility of selection bias, as models trained predominantly on East Asian cohorts may perform suboptimally in more diverse populations (e.g., individuals with darker skin tones or different palpebral fissure sizes). Although the authors acknowledged this issue, no sensitivity analysis was performed to adequately quantify its impact. Second, strabismus diagnosis is not merely a matter of static image recognition; it is inherently dynamic, functional, and context-dependent. Clinical decision-making often relies on subtle temporal cues such as recovery movements during alternate cover testing, gaze stability, or misalignment induced by fatigue3. Such features are difficult to capture in still images or brief video clips. Current AI approaches, with their emphasis on image-based classification, risk reducing strabismus to a static morphological problem while overlooking the binocular visual functions that underpin diagnosis and treatment planning. Third, heterogeneity in study design complicates interpretation. Reclassifying studies that originally focused on specific strabismus subtypes into a simple dichotomy of strabismus versus non-strabismus may obscure important differences in diagnostic complexity between subtypes such as esotropia, exotropia, or vertical deviations4. In real-world clinical practice, accurate subtype identification is crucial for determining appropriate management, and a binary framework may fail to capture these nuances. Finally, while the review acknowledged the “black-box” nature of AI models, the lack of systematic evaluation of interpretability tools is notable. Clinical adoption of AI depends not only on accuracy but also on transparency and usability. Future studies should therefore place greater emphasis on explainability and clinician-centered evaluation strategies. In summary, this meta-analysis provides valuable insights into the diagnostic potential of AI for strabismus screening. However, concerns regarding study bias, subtype reclassification, limited sample diversity, and model interpretability present challenges for rapid clinical translation. We suggest that future research prioritize the use of multicenter and multimodal datasets, establish transparent thresholding strategies, and rigorously evaluate performance across different patient populations and strabismus subtypes. Such efforts are essential for moving AI from a promising screening adjunct to a reliable tool for integrated clinical diagnosis. Finally, we declare that the above complies with the TITAN guidelines5.
Yinwen Shi (Wed,) studied this question.