Remote sensing object detection is fundamental to Earth observation, yet remains challenging when relying on a single sensing modality. While optical imagery provides rich spatial and textural details, it is highly sensitive to illumination and adverse weather; conversely, Synthetic Aperture Radar (SAR) offers robust all-weather acquisition but suffers from speckle noise and limited semantic interpretability. To address these limitations, we leverage the potential of foundation models for optical–SAR object detection via a novel gated–guided fusion approach. By integrating transferable and generalizable representations from foundation models into the detection pipeline, we enhance semantic expressiveness and cross-environment robustness. Specifically, a gated–guided fusion mechanism is designed to selectively merge cross-modal features with foundational priors, enabling the network to prioritize informative cues while suppressing unreliable signals in complex scenes. Furthermore, we propose a dual-stream architecture incorporating attention mechanisms and State Space Models (SSMs) to simultaneously capture local and long-range dependencies. Extensive experiments on the large-scale M4-SAR dataset demonstrate that our method achieves state-of-the-art performance, significantly improving detection accuracy and robustness under challenging sensing conditions.
Jiang et al. (Wed,) studied this question.