Vision transformers (ViTs) excel at global context modeling with self-attention. However, standard self-attention leads to quadratic computational complexity, which restricts its practical use in high-resolution or latency-sensitive tasks. Existing methods achieve linear complexity via local window constraints or additive approximations. However, they often compromise long-range dependency modeling. To address this issue, we propose the channel-guided additive attention vision transformer (CGA-ViT), which achieves synergistic optimization of multi-scale feature extraction and efficient global context modeling. First, we propose multi-scale dilated feature embedding (MDFE). By designing multi-scale sampling and spatial feature embedding, we can expand the receptive field and capture fine-grained features simply by adjusting the dilation rate in the early stages; second, we design channel-guided additive attention (CGA), dynamically modulating key vectors using query-derived descriptors, enabling long-range semantic interactions while maintaining linear complexity growth. We adopt a hierarchical structure, and in the shallow layers, we use CGA to carry out local-global interactions and use efficient additive attention in deep layers for global integration. Evaluations on ImageNet-1K show that CGA-ViT achieves 84.0% Top-1 accuracy with 4.7 GFLOPs, outperforming Swin-T (81.3%) and ConvNeXt-T (82.1%) by 2.7 and 1.9 percentage points under comparable computational costs. Ablation experiments verify MDFE and CGA, which together contribute to 65.0% of performance gains, with the rest from token-level supervision. Overall, CGA-ViT effectively balances the intrinsic tradeoff between efficiency and global modeling capability, significantly boosts visual recognition performance without extra computational overhead, and provides an efficient solution for lightweight ViT design.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhao et al. (Tue,) studied this question.
www.synapsesocial.com/papers/698d6ebb5be6419ac0d54702 — DOI: https://doi.org/10.3390/app16041740
Yayue Zhao
Jingli Miao
Zhenping Li
Applied Sciences
University of Science and Technology Beijing
Academy of Military Medical Sciences
Hebei University of Engineering
Building similarity graph...
Analyzing shared references across papers
Loading...