What question did this study set out to answer?

The study focuses on identifying differential item functioning (DIF) in the SAT Mathematics test forms of Colombia's SABER 11 assessment.

April 15, 2026Open Access

Differential item functioning analysis in large scale assessments: a case study for DIF in SABER 11

Key Points

The study focuses on identifying differential item functioning (DIF) in the SAT Mathematics test forms of Colombia's SABER 11 assessment.
Tailored the DIF identification process based on Sireci and Rios' guiding questions.
Examined the performance of non-compensatory DIF (NCDIF) index and Mantel–Haenszel (MH) DIF procedure under various sample sizes and ratios.
Conducted simulation studies to evaluate the behavior of DIF indices in large scale assessments.
DIF analyses in SABER 11 showed fair score comparisons across test groups.
Type I error rates were influenced by sample sizes and ratios.
Joint use of effect size guidelines mitigated Type I error without significant power loss.

Abstract

Absence of differential item functioning (DIF) is an important piece of evidence to support inferences based on group comparisons of test results. We illustrate how to tailor the DIF identification process following the guiding questions proposed by Sireci and Rios for large scale assessment’s (LSA) specific characteristics by examining DIF between two test forms of the Mathematics test of a Colombian LSA (SABER 11). We investigate the performance of the non-compensatory DIF (NCDIF) index and the Mantel–Haenszel (MH) DIF procedure under large sample sizes and sample size ratios (up to 1:25), and the performance of effect size guidelines under these conditions. These simulations were needed to adequately address the guiding questions for DIF analyses of SABER 11. DIF analyses of SABER 11 test forms were conducted in light of the results of these simulations. Type I error is affected, for both procedures, by both the sample size and sample size ratio, as well as by the magnitude of impact between the groups. The joint use of the effect size guidelines helps mitigate this issue without much loss of power given the large sample sizes involved. The DIF analyses of the Mathematics test forms of SABER 11 provide robust evidence that the inferences derived from score comparisons are fair. Beyond the immediate implications for the use of SABER 11 tests, the presented case study may help guide practitioners in the assessment of DIF by illustrating how to perform several of the steps involved. Moreover, the simulation studies shed new insights into the frequentist behavior of the two DIF indices under conditions that had not been previously explored but which are applicable to many LSA. Additionally, the results indicate that simulation studies examining the performance of NCDIF, MH, and possibly any DIF statistic, should implement realistic item parameter pools and not only sanitized well-distributed sets of item parameters.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

John Alexander Calderón

Nelson Andrés Rodríguez

Víctor H. Cervantes

Journals

Large-scale Assessments in Education

Actions

Institutions

University of Illinois Urbana-Champaign

Fundación para la Educación y el Desarrollo Social

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Differential item functioning analysis in large scale assessments: a case study for DIF in SABER 11

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study