Hybrid CNN-Transformer models have emerged as a promising approach for industrial defect detection, aiming to leverage the complementary strengths of Convolutional Neural Networks (CNNs) and Transformers. This systematic review proposes a dual-path taxonomy to classify hybrid models based on their fusion strategies, namely, structural and modular fusion. Structural fusion strategies include parallel, sequential, and hierarchical fusion, focusing on the information flow between architectural components. Modular fusion strategies involve integrating transformer components into specific stages of object detection architectures, such as the backbone, neck, head, or multi-stage embedding. This review presents a systematic analysis of hybrid models across various industrial sectors, including Printed Circuit Boards (PCBs), steel surfaces, fabric textiles, transmission lines, and railways. A comparative assessment of deployment feasibility, considering inference latency, model size, and edge readiness, is also presented. This review identifies research gaps and provides guidance on future directions, including lightweight design, synthetic data expansion, and domain-transfer techniques. The findings highlight the potential of hybrid CNN-Transformer models to improve defect detection accuracy while addressing the challenges of small, occluded, and irregular defects in complex industrial environments. However, further research is required to optimize model efficiency, generalization, and real-world deployment feasibility.
Assad et al. (Fri,) studied this question.