What question did this study set out to answer?

The aim is to improve automatic speech recognition for low-resource languages by verifying synthetic audio quality.

May 8, 2026Open Access

WAVe: Word-aligned verification of synthetic speech for ASR

Key Points

The aim is to improve automatic speech recognition for low-resource languages by verifying synthetic audio quality.
Generated 22k Portuguese and 35k Dutch synthetic audio samples using GPT-4o-mini and TTS.
Created four training subsets for each language with different synthetic data proportions.
Fine-tuned three Whisper models of different sizes.
Achieved a 7.9% word error rate with a 29k-sample Portuguese subset using Whisper-Large-v3, outperforming a 55k-sample model.
Improved Dutch ASR results with consistent enhancements as well.
Reduced training steps by 34% and lower computational costs while improving ASR quality.

Abstract

Automatic speech recognition for low-resource languages often relies on synthetic utterances to augment limited speech data; these utterances are generated by pairing large-language-model transcripts with neural text-to-speech (TTS) audio. However, indiscriminate incorporation of synthetic audio can reduce training efficiency and introduce errors such as unnatural speech patterns. We introduce WAVe, a model that verifies word-to-audio frame correspondence by aligning text representations with audio features. To evaluate WAVe, we generated 22k Portuguese and 35k Dutch synthetic audio samples using GPT-4o-mini and a TTS system. We created four training subsets per language with varying proportions of synthetic data and fine-tuned three Whisper models of different sizes. For Portuguese, our high-quality 29k-sample subset achieved a 7.9% word error rate with Whisper-Large-v3, outperforming a recent competitive 55k-sample model trained under identical conditions. Experiments on Dutch showed similarly consistent improvements. WAVe reduces the number of training steps by 34%, substantially lowering computational cost while improving ASR quality. Cross-domain evaluation on the Multilingual LibriSpeech benchmark demonstrates that WAVe-based filtering reduces WER from 13.54% to 6.89%. These results establish WAVe as an effective quality control mechanism for synthetic data pipelines, enabling the identification and removal of poorly synthesized audio-text pairs prior to ASR fine-tuning.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yuriy Perezhohin

Mauro Castelli

Journals

Information Sciences

Actions

Institutions

Universidade Nova de Lisboa

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

WAVe: Word-aligned verification of synthetic speech for ASR

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study