The effectiveness of Speech-to-Text (STT) models depends heavily on dataset-level audio and speech characteristics, yet the quantitative influence of these factors remains insufficiently explored, particularly for low-resource lauguages, such as Vietnamese. This study examines how specific audio and speech characteristics, including Speech Rate, Naturalness, Signal-To-Noise Ratio, Audio Coloration and Environmental Reverberation, affect STT performance for Vietnamese. Amongst them, naturalness is notably picked as a new evaluative characteristic with a dedicated metric for dataset selection. Experiments in a real-world setting with a social robots how that tailoring datasets based on these characteristics can respectively improve the accuracy of the trained models by approximately 2.66%, 4.72%, 8.36%, 5.89%, 5.00% compared to training on untailored ones. Additionally, models trained on curated datasets can outperform conventional pre-trained models by up to approximately 8.7% accuracy-wise, highlighting the effectiveness of our approach. The methodology is most useful in practical deployments - such as social robots, voice assistants, and contact-center systems - where field audio is noisier, reverberant, and produced by diverse, non-uniform speakers; its benefit diminishes once sufficiently large, representative training datasets exist.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kiet Pham Gia
Tin Huynh
Kiem Hoang
ACM Transactions on Asian and Low-Resource Language Information Processing
University Of Information Technology
Wyższa Szkoła Technologii Informatycznych w Warszawie
Saigon International University
Building similarity graph...
Analyzing shared references across papers
Loading...
Gia et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69df2c01e4eeef8a2a6b0f07 — DOI: https://doi.org/10.1145/3797912