What question did this study set out to answer?

The aim is to explore the influence of audio and speech features on speech recognition effectiveness for Vietnamese STT models.

April 15, 2026Open Access

Optimizing Vietnamese Speech Recognition Models Through Dataset-Level Audio and Speech Characteristics

Key Points

The aim is to explore the influence of audio and speech features on speech recognition effectiveness for Vietnamese STT models.
Evaluation of audio and speech characteristics like speech rate, naturalness, and signal-to-noise ratio.
Real-world experiments with social robots using tailored datasets based on identified characteristics.
Comparison of model accuracy with untailored and conventional pre-trained datasets.
Improvements in model accuracy by approximately 2.66%, 4.72%, 8.36%, 5.89%, and 5.00% for different characteristics.
Curated datasets outperformed conventional pre-trained models by about 8.7% in accuracy.
The application is most beneficial for noisy environments typical in social robots and voice assistants.

Abstract

The effectiveness of Speech-to-Text (STT) models depends heavily on dataset-level audio and speech characteristics, yet the quantitative influence of these factors remains insufficiently explored, particularly for low-resource lauguages, such as Vietnamese. This study examines how specific audio and speech characteristics, including Speech Rate, Naturalness, Signal-To-Noise Ratio, Audio Coloration and Environmental Reverberation, affect STT performance for Vietnamese. Amongst them, naturalness is notably picked as a new evaluative characteristic with a dedicated metric for dataset selection. Experiments in a real-world setting with a social robots how that tailoring datasets based on these characteristics can respectively improve the accuracy of the trained models by approximately 2.66%, 4.72%, 8.36%, 5.89%, 5.00% compared to training on untailored ones. Additionally, models trained on curated datasets can outperform conventional pre-trained models by up to approximately 8.7% accuracy-wise, highlighting the effectiveness of our approach. The methodology is most useful in practical deployments - such as social robots, voice assistants, and contact-center systems - where field audio is noisier, reverberant, and produced by diverse, non-uniform speakers; its benefit diminishes once sufficiently large, representative training datasets exist.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Kiet Pham Gia

Tin Huynh

Kiem Hoang

Journals

ACM Transactions on Asian and Low-Resource Language Information Processing

Actions

Institutions

University Of Information Technology

Wyższa Szkoła Technologii Informatycznych w Warszawie

Saigon International University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Optimizing Vietnamese Speech Recognition Models Through Dataset-Level Audio and Speech Characteristics

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study