Automatic speech recognition for low-resource languages often relies on synthetic utterances to augment limited speech data; these utterances are generated by pairing large-language-model transcripts with neural text-to-speech (TTS) audio. However, indiscriminate incorporation of synthetic audio can reduce training efficiency and introduce errors such as unnatural speech patterns. We introduce WAVe, a model that verifies word-to-audio frame correspondence by aligning text representations with audio features. To evaluate WAVe, we generated 22k Portuguese and 35k Dutch synthetic audio samples using GPT-4o-mini and a TTS system. We created four training subsets per language with varying proportions of synthetic data and fine-tuned three Whisper models of different sizes. For Portuguese, our high-quality 29k-sample subset achieved a 7.9% word error rate with Whisper-Large-v3, outperforming a recent competitive 55k-sample model trained under identical conditions. Experiments on Dutch showed similarly consistent improvements. WAVe reduces the number of training steps by 34%, substantially lowering computational cost while improving ASR quality. Cross-domain evaluation on the Multilingual LibriSpeech benchmark demonstrates that WAVe-based filtering reduces WER from 13.54% to 6.89%. These results establish WAVe as an effective quality control mechanism for synthetic data pipelines, enabling the identification and removal of poorly synthesized audio-text pairs prior to ASR fine-tuning.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yuriy Perezhohin
Mauro Castelli
Information Sciences
Universidade Nova de Lisboa
Building similarity graph...
Analyzing shared references across papers
Loading...
Perezhohin et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69fd7e42bfa21ec5bbf06735 — DOI: https://doi.org/10.1016/j.ins.2026.123591