Hybrid Vision Transformers (HybridViTs), which integrate convolutional neural networks (CNNs) with Transformer blocks, offer both local and global feature extraction capabilities, achieving high performance across a range of computer vision tasks. However, the substantial computational asymmetry between lightweight CNN blocks and compute-intensive Transformer blocks presents significant challenges for simultaneous optimization and acceleration within a single hardware architecture. To address these challenges, we propose FLASH, a power-efficient field-programmable gate array (FPGA) -based accelerator tailored for CNN-Transformer hybrid networks. FLASH reduces quantization overhead by consolidating redundant quantization-dequantization operations into a single requantization step and enables 8-bit integer-only computation for residual connections through proper scaling factor handling. To further optimize for hardware efficiency, FLASH introduces hardware-friendly linear approximations of nonlinear functions such as Swish and Softmax. By precomputing row-wise max values through offline calibration, we eliminate both max-value search logic and intermediate memory buffering overhead, while reusing shared integer-exponential units to minimize resource consumption. Architecturally, FLASH employs a two-stage pipeline: Stage 1 eliminates external DRAM access using a fully pipelined MobileNetV2 backbone, while Stage 2 accelerates Transformer and convolutional components through specialized compute units and dataflow optimizations. Experimental evaluation using MobileViT (MViT) -xxs on Xilinx VCU118 FPGA demonstrates that FLASH incurs only a 0. 84% accuracy drop on ImageNet-1K compared to the FP32 baseline, while achieving up to 16. 8 lower power consumption and 26. 3 improvement in energy efficiency relative to CPU/GPU implementations. These results establish FLASH as an energy-efficient hardware accelerator for real-time inference of HybridViT models on edge devices.
Building similarity graph...
Analyzing shared references across papers
Loading...
Naeun Kim
Beom Jin Kang
Hae In Lee
IEEE Transactions on Neural Networks and Learning Systems
University of Saskatchewan
Seoul National University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Kim et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69d0ae68659487ece0fa45fe — DOI: https://doi.org/10.1109/tnnls.2026.3677427