Abstract To assess the interrater reliability of human ratings of constructed responses (CR), or the accuracy of scores given by automated scoring engines, concordance metrics quantify agreement between measures. This article examines the quadratic weighted kappa (QWK) in these contexts and highlights its practical limitations compared to other metrics. Both empirical and simulation study results reveal how different factors including the shape of the marginal distributions and score scale length may impact the estimates and how we can adjust for these properties of the contingency table. The results highlight the QWK's sensitivities and suggest that additional caution should be taken before decisions about whether to keep a CR item on a test form are made. If using QWK without the proper interpretive supports, such decisions may be misinformed. Consequently, we make suggestions for best practices to promote responsible evaluation of agreement in the context of CR scoring in educational testing.
Lewis et al. (Wed,) studied this question.