Speech enhancement through denoising is essential for maintaining signal intelligibility and quality in biometric speaker verification pipelines that operate in acoustically adverse conditions. Despite the proliferation of deep learning (DL) architectures for speech denoising, simultaneously optimizing noise attenuation, perceptual fidelity, and speaker-identity preservation remains an open problem. We address this gap by benchmarking three architecturally distinct DL-based enhancement models—Wave-U-Net, CMGAN, and U-Net—on three independent, domain-diverse corpora (SpEAR, VPQAD, and Clarkson) that the models never encountered during training and by introducing commercial-grade VeriSpeak speaker-verification scores as a biometric evaluation dimension absent from prior comparative studies. Our experiments reveal a clear three-way trade-off: U-Net achieves the highest signal-to-noise ratio (SNR) gains (+61.44% on SpEAR, +67.05% on VPQAD, +235.3% on Clarkson) but sacrifices naturalness; CMGAN yields the best perceptual evaluation of speech quality (PESQ) values (3.33, 1.35, and 2.50, respectively), favoring listening-comfort applications; and Wave-U-Net delivers the strongest biometric fidelity (VeriSpeak improvements of +11.63%, +30.22%, and +29.24%) while offering competitive perceptual quality. These results highlight that model selection must be driven by the target deployment scenario and provide actionable guidance for improving biometric verification robustness under real-world noise.
Khondkar et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: