What question did this study set out to answer?

The aim is to optimize multi-scale feature extraction and global context modeling in vision transformers while reducing computational complexity.

February 12, 2026Open Access

CGA-ViT: Channel-Guided Additive Attention for Efficient Vision Recognition

Puntos clave

The aim is to optimize multi-scale feature extraction and global context modeling in vision transformers while reducing computational complexity.
Developed the channel-guided additive attention (CGA) mechanism for long-range semantic interactions.
Implemented multi-scale dilated feature embedding (MDFE) for enhanced feature capturing.
Adopted a hierarchical structure combining local-global interactions in shallow layers with efficient attention in deep layers.
Evaluated performance on ImageNet-1K, comparing with existing models like Swin-T and ConvNeXt-T.
CGA-ViT achieved 84.0% Top-1 accuracy with only 4.7 GFLOPs.
Outperformed Swin-T (81.3%) and ConvNeXt-T (82.1%) by 2.7 and 1.9 percentage points respectively.
MDFE and CGA contributed 65.0% to the performance gains, with additional benefits from token-level supervision.

Resumen

Vision transformers (ViTs) excel at global context modeling with self-attention. However, standard self-attention leads to quadratic computational complexity, which restricts its practical use in high-resolution or latency-sensitive tasks. Existing methods achieve linear complexity via local window constraints or additive approximations. However, they often compromise long-range dependency modeling. To address this issue, we propose the channel-guided additive attention vision transformer (CGA-ViT), which achieves synergistic optimization of multi-scale feature extraction and efficient global context modeling. First, we propose multi-scale dilated feature embedding (MDFE). By designing multi-scale sampling and spatial feature embedding, we can expand the receptive field and capture fine-grained features simply by adjusting the dilation rate in the early stages; second, we design channel-guided additive attention (CGA), dynamically modulating key vectors using query-derived descriptors, enabling long-range semantic interactions while maintaining linear complexity growth. We adopt a hierarchical structure, and in the shallow layers, we use CGA to carry out local-global interactions and use efficient additive attention in deep layers for global integration. Evaluations on ImageNet-1K show that CGA-ViT achieves 84.0% Top-1 accuracy with 4.7 GFLOPs, outperforming Swin-T (81.3%) and ConvNeXt-T (82.1%) by 2.7 and 1.9 percentage points under comparable computational costs. Ablation experiments verify MDFE and CGA, which together contribute to 65.0% of performance gains, with the rest from token-level supervision. Overall, CGA-ViT effectively balances the intrinsic tradeoff between efficiency and global modeling capability, significantly boosts visual recognition performance without extra computational overhead, and provides an efficient solution for lightweight ViT design.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zhao et al. (Tue,) studied this question.

www.synapsesocial.com/papers/698d6ebb5be6419ac0d54702 — DOI: https://doi.org/10.3390/app16041740

Authors

Yayue Zhao

Jingli Miao

Zhenping Li

Journals

Applied Sciences

Actions

Institutions

University of Science and Technology Beijing

Academy of Military Medical Sciences

Hebei University of Engineering

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

CGA-ViT: Channel-Guided Additive Attention for Efficient Vision Recognition

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion