March 3, 2026Open Access

Linear Estimation of Network Specificity (LENS) for Identification of Speech- vs. Noise-Selective Filters in Deep Speech Enhancement

Key Points

Speech-specific filters predominate across almost all layers in the deep enhancement model, indicating their crucial role.
The pseudo-SNR serves as a compact proxy for assessing filter selectivity, helping to quantify contributions of speech and noise.
Observational analysis across U-Net layers shows that speech-specificity increases toward deeper layers, highlighting processing trends.
Filters adapt on a per-utterance basis, challenging assumptions of static selectivity in deep speech enhancement models.

Abstract

Understanding functional processing within deep speech enhancement models remains a challenge, particularly in characterizing how specific filters respond to different signal compositions. We propose an interpretability analysis framework, built on a linear decomposition of feature maps into clean speech and noise contributions, for a U-Net-based speech enhancement model. Using learned weight coefficients for speech and noise, and, respectively, the method directly quantifies the relative contributions of both signal components. To summarize filter behavior across the network, we introduce the pseudo-SNR (pSNR), a log-scaled ratio of to that serves as a compact proxy for signal composition within feature activations. Based on these weights, we classify filters across the U-Net's layers as speech-, noise-, and non-specific, and analyze their roles in processing individual utterances, revealing trends in signal selectivity throughout the encoding-decoding cycle. Results reveal that speech-specific filters predominate across almost all layers, while noise-specific filters are relatively rare. Moreover, pSNR values in speech-specific filters tend to increase toward deeper network layers. Importantly, this speech-specificity is not static: Filters adapt dynamically on a per-utterance basis rather than exhibiting fixed selectivity averaged across the dataset. This finding indicates that, while our framework relies on a linearity assumption, it remains well-suited for analyzing non-linear deep networks by decomposing overall processing through layer- and utterance-specific linear approximations. Overall, our approach offers a principled method to interpret and compare deep speech enhancement models based on internal activation behavior, with potential to guide architectural improvements.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Eike J. Nustede

Jörn Anemüller

Actions

Institutions

Carl von Ossietzky Universität Oldenburg

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Linear Estimation of Network Specificity (LENS) for Identification of Speech- vs. Noise-Selective Filters in Deep Speech Enhancement

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study