March 3, 2026Open Access

Evaluating and enhancing the accuracy of automated fluency annotation tools in L2 research

Key Points

Accuracy for mean pause duration shows significant correlation with manual annotation, which aids evaluation.
The system achieved 0.85 Pearson correlation using the acoustic-based tool on silence-driven metrics.
Evaluation compared two automated tools with a focus on articulation rate and pause-related metrics.
Findings suggest using targeted corrections could enhance text-sensitive metric extraction for L2 learners.

Abstract

Fluency is a central dimension of L2 oral proficiency. Further, fluency assessment is important for many applied contexts, including pedagogical and assessment purposes. Yet, the measurement of fluency using manual annotation is labor-intensive, which limits its broad application and scalability. We evaluate two automated tools — an acoustic-based tool (de Jong et al., 2021) and a machine-learning tool (Matsuura et al., 2025) — using data from L1-Chinese learners of English. Accuracy was assessed for three metrics, articulation rate (AR), pause ratio (PR), and mean pause duration (MPD), via Pearson correlations with manual annotation. We compared two automated tools and tested whether targeted manual post-processing (TextGrid checks and transcript adjustments) improves metric extraction using Steiger’s test. Results from our sample indicated that de Jong et al. (2021) yielded higher accuracy for silence-based metrics (PR, MPD). However, text-dependent metrics (syllable number after removing disfluency words in AR) benefited from corrected TextGrids (for the acoustic tool) or corrected transcripts (for the machine-learning tool). These findings suggest a scalable division of labor: use an acoustic-based tool for silence-driven metrics, and apply corrected transcripts with a machine-learning tool when extracting text-sensitive metrics.

Evaluating and enhancing the accuracy of automated fluency annotation tools in L2 research

Key Points

Abstract

Cite This Study