This paper presents a novel adversarial defense framework that strategically exploits the non-transferability of adversarial attacks across multi-modal foundation models. While Contrastive Language–Image Pre-training (CLIP) models demonstrate remarkable zero-shot capabilities, they remain vulnerable to adversarial samples. Adversarial fine-tuning is widely adopted as a standard defense, yet the resulting robustness against sophisticated white-box attacks is often insufficient. To address this limitation, we aim to boost the robustness of an adversarially fine-tuned model by utilizing a pre-trained auxiliary model to leverage attack non-transferability. Specifically, we construct a common embedding space and introduce a detection scheme that identifies the attack target based on feature distances. By adaptively switching the prediction output, we effectively mitigate attacks. Experimental results demonstrate that our approach outperforms state-of-the-art adversarial fine-tuning methods in terms of adversarial robustness.
Building similarity graph...
Analyzing shared references across papers
Loading...
Koshiro Toishi
Keisuke Maeda
Ren Togo
Applied Sciences
Hokkaido University
Building similarity graph...
Analyzing shared references across papers
Loading...
Toishi et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69e47440010ef96374d8ffeb — DOI: https://doi.org/10.3390/app16083894