Online English as a Second Language (ESL) platforms have expanded rapidly over the past decade, yet the evaluation of learners’ speaking ability on these platforms remains inconsistent, opaque, and under-researched. This study examined the CEFR-Aligned Monitoring System (CAMS)—a structured intervention comprising an analytic rubric, rater calibration, a written feedback protocol, and an in-session rubric routine—deployed in three commercial online ESL platforms. Using a mixed-methods quasi-experimental design, 214 adult ESL learners (aged 19–47) in Southeast Asia and the Middle East were tracked over 16 weeks between September 2024 and January 2025. Learners were grouped by their instructor’s condition, forming a CAMS arm ( n = 112) and a comparison arm ( n = 102) assessed through conventional holistic ratings. Two research questions asked (a) whether CAMS improved inter-rater reliability among online ESL instructors relative to conventional practice, and (b) how learners and instructors experienced the shift from impressionistic to criterion-referenced evaluation. Pre- and post-intervention speaking scores were complemented by 38 learner interviews and four instructor focus groups ( n = 12 instructors), analysed as separate datasets before convergence and divergence were identified. The study advances a conceptual shift from assessment as measurement to assessment as pedagogical monitoring in online learning environments. Instructors in the CAMS arm showed substantially higher inter-rater reliability than those in the comparison arm (ICC = .87 vs.61, p .001), and learners in that arm showed larger gains than their comparison-arm counterparts across fluency, coherence, interaction management, and task fulfilment (composite d = 0.56; fluency and coherence d = 0.74). Learners valued the transparency of criterion-referenced feedback and several spontaneously adopted the rubric for self-monitoring in ways consistent with established models of self-regulated learning; instructors appreciated the structure but raised concerns about workload, task fit, and the limits of the phonological descriptors. Taken together, and read against the limits of the quasi-experimental design, the findings are best interpreted as system-level rather than rubric-level outcomes, and they extend the discourse on digital speaking assessment from measurement alone toward pedagogical monitoring embedded within a broader instructional ecology.
SULIMAN ABDELATY (Thu,) studied this question.