Infrared-visible multimodal object detection plays a vital role in complex environmental conditions. However, existing approaches still suffer from significant limitations in modality redundancy suppression and feature alignment. To address these issues, this paper proposes a novel multimodal fusion detection framework that integrates a Cross-modal Information Bottleneck (CIB) with a Minimum Redundancy Transformation (MRT). The CIB module employs a compress-decompose-reconstruct pathway to selectively preserve shared semantics across modalities, thereby enhancing cross-modal consistency. The MRT module introduces sparse structural transformations along both channel and spatial dimensions, effectively suppressing modality redundancy and strengthening boundary-awareness for target regions. Additionally, we design a dual-phase training strategy based on modality isolation and fusion to stabilize the cooperative representation process. Extensive experiments conducted on two authoritative datasets, KAIST and LLVIP, validate the effectiveness of the proposed approach. Specifically, our method improves the mAP on the KAIST nighttime scenario from 42.8% (Baseline) to 44.1%, and achieves an AP@75 of 80.0% under low-light conditions in LLVIP, outperforming the previous state-of-the-art by 2.4%. Moreover, our method demonstrates consistent performance in robustness evaluations under occlusion and illumination disturbances, highlighting its advantages and application potential in multimodal perception.
Tan et al. (Tue,) studied this question.