What question did this study set out to answer?

This research compares various deep learning models for enhancing speech clarity in noisy environments to improve biometric verification accuracy.

March 25, 2026Open Access

A Comparative Analysis of Deep-Learning-Based Speech Enhancement Models: Assessing Biometric Speaker Verification in Real-World Noisy Environments

Key Points

This research compares various deep learning models for enhancing speech clarity in noisy environments to improve biometric verification accuracy.
Benchmarked three distinct deep learning models (Wave-U-Net, CMGAN, U-Net)
Tested models on three independent, unseen corpora (SpEAR, VPQAD, Clarkson)
Evaluated models using speaker-verification scores as a new comparison metric.
U-Net showed the highest signal-to-noise ratio improvements (+61.44%, +67.05%, +235.3%)
CMGAN achieved the best perceptual quality scores (PESQ values of 3.33, 1.35, 2.50)
Wave-U-Net provided the strongest biometric fidelity improvements (+11.63%, +30.22%, +29.24%)

Abstract

Speech enhancement through denoising is essential for maintaining signal intelligibility and quality in biometric speaker verification pipelines that operate in acoustically adverse conditions. Despite the proliferation of deep learning (DL) architectures for speech denoising, simultaneously optimizing noise attenuation, perceptual fidelity, and speaker-identity preservation remains an open problem. We address this gap by benchmarking three architecturally distinct DL-based enhancement models—Wave-U-Net, CMGAN, and U-Net—on three independent, domain-diverse corpora (SpEAR, VPQAD, and Clarkson) that the models never encountered during training and by introducing commercial-grade VeriSpeak speaker-verification scores as a biometric evaluation dimension absent from prior comparative studies. Our experiments reveal a clear three-way trade-off: U-Net achieves the highest signal-to-noise ratio (SNR) gains (+61.44% on SpEAR, +67.05% on VPQAD, +235.3% on Clarkson) but sacrifices naturalness; CMGAN yields the best perceptual evaluation of speech quality (PESQ) values (3.33, 1.35, and 2.50, respectively), favoring listening-comfort applications; and Wave-U-Net delivers the strongest biometric fidelity (VeriSpeak improvements of +11.63%, +30.22%, and +29.24%) while offering competitive perceptual quality. These results highlight that model selection must be driven by the target deployment scenario and provide actionable guidance for improving biometric verification robustness under real-world noise.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Khondkar et al. (Mon,) studied this question.

synapsesocial.com/papers/69c37adcb34aaaeb1a67cbc0 https://doi.org/https://doi.org/10.3390/bdcc10030098

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper