In recent years, with the rapid development of cross-modal learning, pretrained models such as CLIP have demonstrated powerful zero-shot capabilities in image-text alignment tasks, making them central to multimodal research. However, a key challenge remains: how to effectively transfer these capabilities while preserving the strengths of CLIP. To address this, we propose a parameter-efficient multi-task fine-tuning frameworkMulti-Task CLIP-Adapter. By inserting lightweight Adapter modules after the frozen CLIP encoder, our method enables unified adaptation across multiple tasks, including classification, image-text retrieval, and regression. Experimental results show that our approach achieves an 8%12% performance improvement with less than 0.2% additional parameters, while maintaining the original models zero-shot capability. Compared to the original CLIP and conventional transfer strategies, the Multi-Task CLIP-Adapter offers significant advantages in parameter efficiency and task generalization, paving a new path for scalable applications of large multimodal models.
Building similarity graph...
Analyzing shared references across papers
Loading...
Ji Han
Applied and Computational Engineering
Harbin University of Science and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Ji Han (Wed,) studied this question.
www.synapsesocial.com/papers/68c183f09b7b07f3a060f830 — DOI: https://doi.org/10.54254/2755-2721/2025.bj26532
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: