The combination of Spiking Neural Networks (SNNs) with Vision Transformer architectures has garnered significant attention due to their potential for energy-efficient and high-performance computing paradigms. However, a substantial performance gap still exists between SNN-based and ANN-based transformer architectures. While existing methods propose spiking self-attention mechanisms that are successfully combined with SNNs, the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting features from different image scales. In this paper, we address this issue and propose MSVIT. This novel spike-driven Transformer architecture firstly uses multi-scale spiking attention (MSSA) to enhance the capabilities of spiking attention blocks. We validate our approach across various main datasets. The experimental results show that MSVIT outperforms existing SNN-based models, positioning itself as a state-of-the-art solution among SNN-transformer architectures. The codes are available at https://github.com/Nanhu-AI-Lab/MSViT.
Building similarity graph...
Analyzing shared references across papers
Loading...
Hua et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68f58f68ece7a5b64f471312 — DOI: https://doi.org/10.48550/arxiv.2505.14719
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Wei Hua
Chenlin Zhou
Jibin Wu
Building similarity graph...
Analyzing shared references across papers
Loading...