What question did this study set out to answer?

This research aims to improve sound event detection accuracy while reducing computational costs using an Efficient Conformer model.

March 7, 2026Open Access

Exploring efficient attention strategies in conformer-based sound event detection

Key Points

This research aims to improve sound event detection accuracy while reducing computational costs using an Efficient Conformer model.
Proposed an Efficient Conformer architecture with subsampling along input sequences.
Evaluated performance using the Polyphonic Sound Detection Score on the DCASE Challenge 2023 Task 4 benchmark.
Conducted experiments on the DESED validation dataset comparing to standard Conformer and CRNN.
Efficient Conformer achieved improved localization accuracy without increasing output length.
Outperformed standard Conformer and CRNN baselines in class robustness measurements.
Reduced computational costs by over 69% while maintaining performance comparable to heavier models.

Abstract

Sound Event Detection (SED) requires models that can accurately localize and classify overlapping audio events within complex acoustic environments. Conformer-based architectures have demonstrated promising performance by leveraging self-attention to capture long-range dependencies. However, this global attention can be accumulated across layers, which can blur local temporal boundaries and reduce detection accuracy, especially for short or closely spaced events. While increasing the input sequence length can help recover temporal detail, the quadratic complexity of Conformers’ self-attention significantly increases computational costs. To address this, we propose integrating the Efficient Conformer architecture, which introduces subsampling along the input sequence length, effectively reducing the temporal dimension within blocks. This design enables processing longer input sequences at finer temporal resolution, enhancing localization accuracy without extending output length. Using the DCASE Challenge 2023 Task 4 benchmark, system performance is evaluated via the threshold-independent Polyphonic Sound Detection Score (PSDS), measuring both localization precision (PSDS1) and class robustness (PSDS2). Experiments on the DESED validation dataset demonstrate that the Efficient Conformer not only improves temporal resolution and long-range dependency modeling, but also outperforms standard Conformer and Convolutional Recurrent Neural Network (CRNN) baselines in PSDS2. Additionally, we explore lightweight attention mechanisms employing squeeze-and-excitation blocks to emulate frequency-axis translation invariance of Frequency Dynamic Convolutions (FDY). Our approach achieves performance comparable to heavier models like FDY+Conformer, while reducing computational cost by over 69%, showing promising results for Conformer-based systems in terms of precision and model efficiency.

Exploring efficient attention strategies in conformer-based sound event detection

Key Points

Abstract

Cite This Study