Abstract State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. Despite their effectiveness, SSMs struggle with global context modeling due to data-independent matrices. The Mamba model addresses this with data-dependent variants enabled by the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures face significant parameter scalability challenges, limiting their utility in vision applications. This paper tackles the scalability issue of large SSMs for image classification and action recognition without relying on additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures and increases robustness to common corruption artifacts. Our thorough evaluation on the ImageNet-1K, Kinetics-400, and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to +1. 7 + 1. 7 %.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hamid Suleman
Syed Talal Wasim
Muzammal Naseer
International Journal of Computer Vision
University of Bonn
Khalifa University of Science and Technology
Lamarr Institute for Machine Learning and Artificial Intelligence
Building similarity graph...
Analyzing shared references across papers
Loading...
Suleman et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69e473ff010ef96374d8fc37 — DOI: https://doi.org/10.1007/s11263-026-02824-0