Sound Event Detection (SED) requires models that can accurately localize and classify overlapping audio events within complex acoustic environments. Conformer-based architectures have demonstrated promising performance by leveraging self-attention to capture long-range dependencies. However, this global attention can be accumulated across layers, which can blur local temporal boundaries and reduce detection accuracy, especially for short or closely spaced events. While increasing the input sequence length can help recover temporal detail, the quadratic complexity of Conformers’ self-attention significantly increases computational costs. To address this, we propose integrating the Efficient Conformer architecture, which introduces subsampling along the input sequence length, effectively reducing the temporal dimension within blocks. This design enables processing longer input sequences at finer temporal resolution, enhancing localization accuracy without extending output length. Using the DCASE Challenge 2023 Task 4 benchmark, system performance is evaluated via the threshold-independent Polyphonic Sound Detection Score (PSDS), measuring both localization precision (PSDS1) and class robustness (PSDS2). Experiments on the DESED validation dataset demonstrate that the Efficient Conformer not only improves temporal resolution and long-range dependency modeling, but also outperforms standard Conformer and Convolutional Recurrent Neural Network (CRNN) baselines in PSDS2. Additionally, we explore lightweight attention mechanisms employing squeeze-and-excitation blocks to emulate frequency-axis translation invariance of Frequency Dynamic Convolutions (FDY). Our approach achieves performance comparable to heavier models like FDY+Conformer, while reducing computational cost by over 69%, showing promising results for Conformer-based systems in terms of precision and model efficiency.
Barahona et al. (Sun,) studied this question.