In recent years, with the rapid development of cross-modal learning, pretrained models such as CLIP have demonstrated powerful zero-shot capabilities in image-text alignment tasks, making them central to multimodal research. However, a key challenge remains: how to effectively transfer these capabilities while preserving the strengths of CLIP. To address this, we propose a parameter-efficient multi-task fine-tuning frameworkMulti-Task CLIP-Adapter. By inserting lightweight Adapter modules after the frozen CLIP encoder, our method enables unified adaptation across multiple tasks, including classification, image-text retrieval, and regression. Experimental results show that our approach achieves an 8%12% performance improvement with less than 0.2% additional parameters, while maintaining the original models zero-shot capability. Compared to the original CLIP and conventional transfer strategies, the Multi-Task CLIP-Adapter offers significant advantages in parameter efficiency and task generalization, paving a new path for scalable applications of large multimodal models.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ji Han (Wed,) studied this question.
www.synapsesocial.com/papers/68c183f09b7b07f3a060f830 — DOI: https://doi.org/10.54254/2755-2721/2025.bj26532
Ji Han
Applied and Computational Engineering
Harbin University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...