Understanding functional processing within deep speech enhancement models remains a challenge, particularly in characterizing how specific filters respond to different signal compositions. We propose an interpretability analysis framework, built on a linear decomposition of feature maps into clean speech and noise contributions, for a U-Net-based speech enhancement model. Using learned weight coefficients for speech and noise, and, respectively, the method directly quantifies the relative contributions of both signal components. To summarize filter behavior across the network, we introduce the pseudo-SNR (pSNR), a log-scaled ratio of to that serves as a compact proxy for signal composition within feature activations. Based on these weights, we classify filters across the U-Net's layers as speech-, noise-, and non-specific, and analyze their roles in processing individual utterances, revealing trends in signal selectivity throughout the encoding-decoding cycle. Results reveal that speech-specific filters predominate across almost all layers, while noise-specific filters are relatively rare. Moreover, pSNR values in speech-specific filters tend to increase toward deeper network layers. Importantly, this speech-specificity is not static: Filters adapt dynamically on a per-utterance basis rather than exhibiting fixed selectivity averaged across the dataset. This finding indicates that, while our framework relies on a linearity assumption, it remains well-suited for analyzing non-linear deep networks by decomposing overall processing through layer- and utterance-specific linear approximations. Overall, our approach offers a principled method to interpret and compare deep speech enhancement models based on internal activation behavior, with potential to guide architectural improvements.
Building similarity graph...
Analyzing shared references across papers
Loading...
Eike J. Nustede
Jörn Anemüller
Carl von Ossietzky Universität Oldenburg
Building similarity graph...
Analyzing shared references across papers
Loading...
Nustede et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69a75c6dc6e9836116a254f9 — DOI: https://doi.org/10.5281/zenodo.17880322