What question did this study set out to answer?

This research aims to evaluate the effectiveness of quadratic weighted kappa in assessing interrater reliability for both human and automated scoring systems.

March 10, 2026

Limitations of QWK in Evaluating Automated and Human Scoring Systems

Key Points

This research aims to evaluate the effectiveness of quadratic weighted kappa in assessing interrater reliability for both human and automated scoring systems.
Analyzed concordance metrics to evaluate agreement between human ratings and automated scores.
Conducted empirical and simulation studies to assess the QWK's limitations.
Investigated the effects of marginal distributions and score scale length on QWK estimates.
Identified practical limitations of QWK in accurately reflecting interrater agreement.
Revealed that different factors can significantly impact QWK estimates.
Suggested best practices to improve evaluation methods related to constructed responses scoring.

Abstract

Abstract To assess the interrater reliability of human ratings of constructed responses (CR), or the accuracy of scores given by automated scoring engines, concordance metrics quantify agreement between measures. This article examines the quadratic weighted kappa (QWK) in these contexts and highlights its practical limitations compared to other metrics. Both empirical and simulation study results reveal how different factors including the shape of the marginal distributions and score scale length may impact the estimates and how we can adjust for these properties of the contingency table. The results highlight the QWK's sensitivities and suggest that additional caution should be taken before decisions about whether to keep a CR item on a test form are made. If using QWK without the proper interpretive supports, such decisions may be misinformed. Consequently, we make suggestions for best practices to promote responsible evaluation of agreement in the context of CR scoring in educational testing.

Bookmark

Limitations of QWK in Evaluating Automated and Human Scoring Systems

Key Points

Abstract

Cite This Study