What question did this study set out to answer?

This research aims to explore how automatic speech recognition models manage errors in Korean syllable-final consonant clusters, focusing on dialectal and phonological variations.

April 16, 2026Open Access

Whisper automatic speech recognition errors in Korean complex coda recognition:Syllable coda phonotactics and dialectal variation

Key Points

This research aims to explore how automatic speech recognition models manage errors in Korean syllable-final consonant clusters, focusing on dialectal and phonological variations.
Analyzed 54,536 speech utterances from the National Institute of Korean Language Corpus.
Compared models of different sizes (small, medium, large-v2, large-v3).
Examined utterances from two Korean dialects: Seoul and Gyeongsang.
Measured overall and consonant cluster error rates.
Overall coda error rates ranged from 2.67% to 4.53% across models.
Consonant cluster error rates varied from 2.94% to 6.93%, with Seoul showing better performance than Gyeongsang.
Phonologically-motivated errors constituted 23%–41% of all errors, with C2 simplification being the most common.
The advantage of Seoul over Gyeongsang in C1 simplification errors decreased with larger model sizes.

Abstract

This study investigates how Whisper automatic speech recognition models handle Korean syllable-final consonant clusters, focusing on the effects of dialectal variation, model size and phonological variation. We analyzed 54,536 spontaneous-speech utterances from the National Institute of Korean Language (NIKL) Daily Conversation Corpus 2022, comparing speakers from Seoul and Gyeongsang (consolidating Busan, Daegu, Gyeongnam, Gyeongbuk, and Ulsan) across four Whisper variants (small, medium, large-v2, large-v3). Overall coda error rates ranged from 2.67% (large-v3) to 4.53% (small), while consonant cluster error rates were substantially higher, ranging from 2.94% (large-v3, Seoul) to 6.93% (small, Gyeongsang). Seoul consistently showed lower error rates than Gyeongsang, but this advantage narrowed with model size and disappeared in large-v3. Phonologically-motivated errors constituted a consistent and substantial minority of all errors (23%–41% across models and varieties), with a statistically robust subtype hierarchy: C2 simplification was dominant, followed by C1 simplification and aspiration merger, while nasal assimilation and resyllabification occurred at near-zero rates. A Seoul-over-Gyeongsang asymmetry in C1 simplification errors, significant in smaller models and attenuating with scale, is attributed to an interaction between Whisper

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Tae-Jin Yoon

Soohyun Kwon

Jeong-Im Han

Journals

Phonetics and Speech Sciences

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Whisper automatic speech recognition errors in Korean complex coda recognition:Syllable coda phonotactics and dialectal variation

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider