Vision-Language Models (VLMs) such as CLIP have demonstrated remarkable capabilities in understanding relationships between visual and textual data through joint embedding spaces. Despite their effectiveness, these models remain vulnerable to adversarial attacks, particularly in the image modality, posing significant security concerns. Building upon our previous work on Adversarial Prompt Tuning (AdvPT), which introduced learnable text prompts to enhance adversarial robustness in VLMs without extensive parameter training, we present a significant extension by introducing the Neural Augmentor framework for Multi-modal Adversarial Prompt Tuning (NAP-Tuning). As a significant extension, NAP-Tuning first establishes a comprehensive multi-modal (text and visual) and multi-layer prompting framework. The core of this framework is a targeted structural augmentation for feature-level purification, implemented through our Neural Augmentor approach. This framework implements feature purification by incorporating TokenRefiners-lightweight neural modules that learn to reconstruct purified features via residual connections-to directly address distortions in the feature space. This structural intervention is what enables the multi-modal and multi-layer system to effectively perform modality-specific and layer-specific feature rectification. Comprehensive experiments demonstrate that NAP-Tuning significantly outperforms existing methods across various datasets and attack types. Notably, our approach shows significant improvements over the strongest baselines under the challenging AutoAttack benchmark, outperforming them by 32.3% on ViT-B16 and 31.3% on ViT-B32 architectures while maintaining competitive clean accuracy. This work highlights the efficacy of internal feature-level intervention in prompt tuning for adversarial robustness, moving beyond input-side alignment approaches to create an adaptive defense mechanism that can identify and rectify adversarial perturbations across embedding spaces.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jiaming Zhang
Xin Wang
Y. Ma
IEEE Transactions on Pattern Analysis and Machine Intelligence
Fudan University
University of Naples Federico II
Beijing Jiaotong University
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Thu,) studied this question.
www.synapsesocial.com/papers/6980ff19c1c9540dea811d46 — DOI: https://doi.org/10.1109/tpami.2026.3659598
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: