What question did this study set out to answer?

The aim is to improve voice quality in real-world applications by developing a reinforcement learning-based speech enhancement model that adapts to varying acoustic conditions.

March 23, 2026Open Access

Reinforcement-Learned Speech Enhancement Models for Real-Time Adaptive Human-Computer Interaction

Key Points

The aim is to improve voice quality in real-world applications by developing a reinforcement learning-based speech enhancement model that adapts to varying acoustic conditions.
Developed an RL agent using a deep neural network for dynamic voice enhancement settings.
Utilized the Firefly Algorithm to optimize RL hyperparameters and improve stability and convergence.
Created a reward function based on metrics like signal-to-noise ratio (SNR) and perceptual evaluation of speech quality (PESQ).
Achieved a 22% improvement in PESQ and a 16% improvement in short-time objective intelligibility (STOI) compared to traditional methods.
Demonstrated better performance in varying acoustic environments, enhancing real-time speech augmentation.

Abstract

Virtual assistants, teleconferencing, and assistive technologies utilize real-time speech augmentation. Traditional deep learning-based speech augmentation methods often yield inferior voice quality in real-world scenarios due to their reliance on static parameters and inability to adapt to dynamic acoustic conditions. This paper recommends RL-SEM, which combines RL with adaptive optimization to improve intelligibility and reduce noise. An RL agent led by a deep neural network may dynamically adjust voice enhancement settings in the proposed framework using contextual information. The Firefly Algorithm (FA) improves learning stability, convergence, policy exploration, and strong adaptation while optimizing RL hyperparameters. A reward function is created using an objective metric, such as signal-to-noise ratio (SNR), perceptual evaluation of speech quality (PESQ), or short-time objective intelligibility, to generate realistic speech. In stationary, non-stationary, and real-time noisy situations, RL-SEM outperforms DNN and spectral subtraction by 22% in PESQ and 16% in STOI in terms of latency. Finally, RL-SEM’s flexible and adaptive real-time speech augmentation architecture improves next-gen HCI applications.

Reinforcement-Learned Speech Enhancement Models for Real-Time Adaptive Human-Computer Interaction

Key Points

Abstract

Cite This Study