April 30, 2024Open Access

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Key Points

Key points are not available for this paper at this time.

Abstract

Pretrained vision-language models (VLMs) like CLIP have shown impressive generalization performance across various downstream tasks, yet they remain vulnerable to adversarial attacks. While prior research has primarily concentrated on improving the adversarial robustness of image encoders to guard against attacks on images, the exploration of text-based and multimodal attacks has largely been overlooked. In this work, we initiate the first known and comprehensive effort to study adapting vision-language models for adversarial robustness under the multimodal attack. Firstly, we introduce a multimodal attack strategy and investigate the impact of different attacks. We then propose a multimodal contrastive adversarial training loss, aligning the clean and adversarial text embeddings with the adversarial and clean visual features, to enhance the adversarial robustness of both image and text encoders of CLIP. Extensive experiments on 15 datasets across two tasks demonstrate that our method significantly improves the adversarial robustness of CLIP. Interestingly, we find that the model fine-tuned against multimodal adversarial attacks exhibits greater robustness than its counterpart fine-tuned solely against image-based attacks, even in the context of image attacks, which may open up new possibilities for enhancing the security of VLMs.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zhou et al. (Tue,) studied this question.

www.synapsesocial.com/papers/68e6cdf2b6db64358764bd1b — DOI: https://doi.org/10.48550/arxiv.2404.19287

Authors

Wanqi Zhou

Shuanghao Bai

Qibin Zhao

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Revisiting the Adversarial Robustness of Vision Language Models: a Multimodal Perspective

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion