This study investigates how Whisper automatic speech recognition models handle Korean syllable-final consonant clusters, focusing on the effects of dialectal variation, model size and phonological variation. We analyzed 54,536 spontaneous-speech utterances from the National Institute of Korean Language (NIKL) Daily Conversation Corpus 2022, comparing speakers from Seoul and Gyeongsang (consolidating Busan, Daegu, Gyeongnam, Gyeongbuk, and Ulsan) across four Whisper variants (small, medium, large-v2, large-v3). Overall coda error rates ranged from 2.67% (large-v3) to 4.53% (small), while consonant cluster error rates were substantially higher, ranging from 2.94% (large-v3, Seoul) to 6.93% (small, Gyeongsang). Seoul consistently showed lower error rates than Gyeongsang, but this advantage narrowed with model size and disappeared in large-v3. Phonologically-motivated errors constituted a consistent and substantial minority of all errors (23%–41% across models and varieties), with a statistically robust subtype hierarchy: C2 simplification was dominant, followed by C1 simplification and aspiration merger, while nasal assimilation and resyllabification occurred at near-zero rates. A Seoul-over-Gyeongsang asymmetry in C1 simplification errors, significant in smaller models and attenuating with scale, is attributed to an interaction between Whisper
Building similarity graph...
Analyzing shared references across papers
Loading...
Tae-Jin Yoon
Soohyun Kwon
Jeong-Im Han
Phonetics and Speech Sciences
Building similarity graph...
Analyzing shared references across papers
Loading...
Yoon et al. (Sun,) studied this question.
www.synapsesocial.com/papers/69e07de52f7e8953b7cbee2d — DOI: https://doi.org/10.13064/ksss.2026.18.1.055
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: