What question did this study set out to answer?

The central aim is to address the scalability challenges of large state-space models in vision applications.

April 19, 2026Open Access

Distillation-free Scaling of Large State-Space Models for Images and Videos

Key Points

The central aim is to address the scalability challenges of large state-space models in vision applications.
Analyzed characteristics of Mamba-based and Attention-based models.
Proposed an interleaved architecture combining Mamba and Attention mechanisms.
Evaluated the approach using standard benchmarks like ImageNet-1K and Kinetics-400.
The interleaved architecture improved scalability and robustness against corruption artifacts.
Accuracy increased by up to 1.7% compared to state-of-the-art Mamba-based architectures.

Abstract

Abstract State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. Despite their effectiveness, SSMs struggle with global context modeling due to data-independent matrices. The Mamba model addresses this with data-dependent variants enabled by the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures face significant parameter scalability challenges, limiting their utility in vision applications. This paper tackles the scalability issue of large SSMs for image classification and action recognition without relying on additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures and increases robustness to common corruption artifacts. Our thorough evaluation on the ImageNet-1K, Kinetics-400, and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to +1. 7 + 1. 7 %.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Hamid Suleman

Syed Talal Wasim

Muzammal Naseer

Journals

International Journal of Computer Vision

Actions

Institutions

University of Bonn

Khalifa University of Science and Technology

Lamarr Institute for Machine Learning and Artificial Intelligence

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Distillation-free Scaling of Large State-Space Models for Images and Videos

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study