Virtual assistants, teleconferencing, and assistive technologies utilize real-time speech augmentation. Traditional deep learning-based speech augmentation methods often yield inferior voice quality in real-world scenarios due to their reliance on static parameters and inability to adapt to dynamic acoustic conditions. This paper recommends RL-SEM, which combines RL with adaptive optimization to improve intelligibility and reduce noise. An RL agent led by a deep neural network may dynamically adjust voice enhancement settings in the proposed framework using contextual information. The Firefly Algorithm (FA) improves learning stability, convergence, policy exploration, and strong adaptation while optimizing RL hyperparameters. A reward function is created using an objective metric, such as signal-to-noise ratio (SNR), perceptual evaluation of speech quality (PESQ), or short-time objective intelligibility, to generate realistic speech. In stationary, non-stationary, and real-time noisy situations, RL-SEM outperforms DNN and spectral subtraction by 22% in PESQ and 16% in STOI in terms of latency. Finally, RL-SEM’s flexible and adaptive real-time speech augmentation architecture improves next-gen HCI applications.
Ishaq et al. (Thu,) studied this question.