March 3, 2026Open Access

Human raters vs Chat GPT : how do their scorings of EFL learners’ academic writing differ?

Key Points

ChatGPT and human raters showed moderate agreement in scoring grammar, vocabulary, and sentence structure.
A total of 7072 scores were analyzed, including 416 global and 6656 analytic scores across two writing tasks.
Data analysis utilized correlations and paired-samples t-test for comprehensive evaluation.
Findings suggest potential for AI tools in language assessment, but further investigation is necessary.

Abstract

Artificial Intelligence (AI) in language assessment has gained attention due to its potential to minimize teachers’ complex task of assessing students’ writing. Although previous research has explored the use of technological tools to assess EFL learners’ writing, there is a need to further investigate how AI, particularly ChatGPT, can be used as an assessment tool in high-stakes writing assessment, and whether the scores provided by the AI are similar to those assigned by human raters. In the context of a high-stakes writing test at Universidad del Valle, this quantitative research investigated how the scorings of EFL university teachers differed from those given by ChatGPT when assessing EFL university learners’ written productions (personal opinion essay and data explanatory essay). Two argumentative writing tasks from two cohorts of ninth-semester EFL learners (N= 208) were used to compare the global and analytic ratings awarded by a pool of 20 human raters with those of ChatGPT using an analytic scoring rubric. The analytic dimensions included content, coherence and cohesion, sentence structure, grammar, and vocabulary. A total of 7072 scores were analyzed, including 416 global (208 human and 208 ChatGPT) and 6656 analytic scores across two writing tasks. Analytic scoring covered seven criteria for Task 1 and nine for Task 2, with both human raters and ChatGPT providing an equal number of ratings (3328 each). Data were analyzed on JASP, by using correlations and paired-samples t-test. According to the results, ChatGPT showed moderate agreement with human raters in surface-level dimensions such as grammar, vocabulary, and sentence structure.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Valentina Zapata Villano

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Human raters vs Chat GPT : how do their scorings of EFL learners’ academic writing differ?

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study