Estimation of item difficulty is essential in language test development, but recent attention has shifted toward the need also to explain and predict it. This has practical implications for item development, adaptive testing, and construct validation. Measurement specialists have traditionally explored factors contributing to item difficulty through explanatory item response theory (EIRT). In language assessment, explaining difficulty remains challenging due to the complex, context-sensitive nature of linguistic constructs. Advances in artificial intelligence (AI), most notably in machine learning (ML) and natural language processing (NLP) have expanded possibilities, offering scalable and flexible solutions, but may compromise interpretability, i.e., the capacity to link results to the underlying construct of ability. In sensitive areas, such as immigration and citizenship, generating validation evidence is critical, giving rise to a pressing need to understand the implications of using ML models in this context. This conceptual paper explores the meeting ground between measurement and machine learning, examining how these traditions converge and diverge in modelling item difficulty. Trade-offs between model interpretability and scalable application are highlighted, and implications discussed in the light of the increasingly interdisciplinary nature of this field, including possibilities offered by hybrid IRT-ML solutions.
Dunn et al. (Thu,) studied this question.