March 3, 2026

Automatic annotation of learner errors: Testing the reliability of LanguageToolR

Key Points

LanguageToolR demonstrated moderate reliability (r=0.62) for grammatical errors, but underperformed for typographical errors (r=0.24).
Inter-rater reliability achieved an acceptable level (𝜅 = .87) when previous annotations were provided, but dropped to 𝜅 = .61-.76 without corrections.
Precision and recall scores indicate LanguageToolR's effectiveness is only notable for spelling errors (precision = 0.69; recall = 0.68).
Overall, findings suggest LanguageToolR is helpful for obvious spelling errors but fails to reliably capture crucial grammatical and typographical errors.

Abstract

Accuracy is a crucial construct in the evaluation of L2 performance (Housen Thewissen, 2021), but measuring it usually requires annotating errors manually, thereby limiting analyses to rather small text samples. If researchers wish to capture accuracy across a large number of texts, this is not always feasible given the time and expense required. One solution to this problem may be to use automatic tools to extract errors from learner texts. In this study, we examine the reliability of one such automatic tool, namely LanguageTool (v5.9; Naber, 2003) as implemented through the R package LanguageToolR (v0.1.4; Schmid, 2023). To assess the reliability of LanguageToolR, we used as the gold standard a subset of texts from the International Corpus of Learner English (ICLE) (Granger et al., 2020). The subset includes 223 learner assignments that have previously been manually annotated for errors (for details see Thewissen, 2013) following the UCLouvain Error Tagging manual (v1.2; Dagneaux et al., 2005). This dataset was submitted to a two-step methodological procedure which we will report on in this presentation: (1) checking the reliability of the ICLE "gold standard" corpus: a subsample of the manually error-tagged corpus was independently analysed for errors by a second annotator and inter-rater reliability statistics were calculated on that basis; (2) checking LanguageToolR reliability: the errors detected by LanguageTool were then compared against those in the manually-coded gold standard text-level via correlations for the number of errors in each text as well as precision and recall scores for each error type. The analyses revealed several key results: firstly, the two annotators were able to reach an acceptable level of interrater agreement (𝜅 = .87) but only when both the error and correction initially inserted by annotator 1 were provided to the second annotator. Agreement was somewhat lower (𝜅 = .61-.76) when corrections were not provided. Secondly, consistent with a previous study investigating the reliability of LanguageTool on another corpus (Crossley et al., 2019), we found moderate correlations between the manual and LanguageToolR annotations for grammatical errors (r= .62), relatively strong correlations for spelling errors (r= .87) but weaker correlations for typographical errors (capitalisation, missing commas, possessive apostrophes, etc.) (r= .24). On the level of individual errors, however, the accuracy of LanguageToolR was found to be very low for typographical errors (precision = 0.01; recall < 0.01) and grammatical errors (precision = 0.49; recall = 0.05). The identification of spelling errors was slightly more accurate (precision = 0.69; recall = 0.68). Qualitatively, these results mean that, while LanguageToolR may be somewhat useful for identifying obvious spelling errors, it underdetects crucial error types compared to a manual method. Among the elements it does flag, quite a few are in fact overcorrections (false positives), shedding doubt on its usability in an L2 proficiency assessment context beyond perhaps the annotation of (some) spelling errors. References Crossley, S. A., Bradfield, F., & Bustamante, A. (2019). Using human judgments to examine the validity of automated grammar, syntax, and mechanical errors in writing. Journal of Writing Research, 11(2), 251–270. https://doi.org/10.17239/jowr-2019.11.02.01 Dagneaux, E., Denness, S., Meunier, F., Neff, J., & Thewissen, J. (2005). The Louvain Error Tagging Manual Version 1.2. Centre for English Corpus Linguistics. 30 Granger, S., Dupont, M., Meunier, F., Naets, H., & Paquot, M. (2020). The International Corpus of Learner English (Version 3). Presses universitaires de Louvain. https://dial.uclouvain.be/pr/boreal/object/boreal:229877 Housen, A., & Kuiken, F. (2009). Complexity, accuracy, and fluency in second language acquisition. Applied Linguistics, 30(4), 461–473. https://doi.org/10.1093/applin/amp048 Naber, D. (2003). A rule-based style and grammar checker Bachelor Thesis. Universität Bielefeld. Schmid, C. (2023). LanguageToolR: Provides a wrapper for the LanguageTool CLI tool for spelling, grammar and language checking. https://github.com/nevrome/LanguageToolR Thewissen, J. (2013). Capturing L2 Accuracy Developmental Patterns: Insights from an Error‐Tagged EFL Learner Corpus. The Modern Language Journal, 97(S1), 77–101. https://doi.org/10.1111/j.1540- 4781.2012.01422.x Thewissen, J. (2021). Accuracy. In N. Tracy-Ventura & M. Paquot (Eds.), The Routledge Handbook of Second Language Acquisition and Corpora (pp. 305–317). Routledge.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Thewissen et al. (Mon,) studied this question.

www.synapsesocial.com/papers/69a75fb2c6e9836116a2b602

Authors

Jennifer Thewissen

The Journée linguistique du Cercle belge

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Automatic annotation of learner errors: Testing the reliability of LanguageToolR

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion