Aiming at the key problems in packaging appearance defect detection, such as difficult cross-modal feature alignment, strong interference from printed textures, and poor compatibility with multi-scale defects, this paper proposes a multi-modal defect detection network (PackNet-MF) that fuses RGB images and PCD (Point Cloud Data). This study proposes the PackNet-MF network with a dual-branch Swin-Transformer as its backbone, combined with three core innovative modules to form a targeted solution: the Cross-modal Attention Fusion Mechanism (CMA) addresses deformation issues through dynamic feature alignment, and its adaptive weight adjustment capability enables more accurate matching of multi-modal data; the Defect Context Awareness Mechanism (DCA) effectively suppresses texture interference, achieving precise distinction between defects and interference via semantic modeling; the Multi-scale Spatial Adaptive Aggregation Mechanism (MS-AA) ensures accurate detection of defects at different scales, covering the full-scale requirements from micro-scratches to large damages through dynamic receptive field adjustment. Starting from core pain points, these modules provide key support for improving detection performance. Experimental results show that on the self-constructed Pack-Defect dataset, PackNet-MF achieves an F1-score of 0.9012, which represents a significant improvement over baseline methods such as U-Net. It also exhibits excellent performance in the localization accuracy of tiny defects, with a mean Intersection over Union (mIoU) of 0.8473. Furthermore, transfer experiments on the public NEU-Seg dataset further verify the model’s strong generalization ability.
Wang et al. (Wed,) studied this question.