Instruction tuning has become a widely adopted approach for aligning large multimodal models (LMMs) with human intent. It enables multi-task joint training through unified data formats. However, as new vision-language tasks constantly emerge, exhaustive joint training of all tasks becomes impractical. Continual learning offers a more flexible and resource-efficient alternative, enabling incremental training of LMMs on emerging tasks. This study investigates two fundamental questions when applying continual learning to instruction tuning of LMMs: 1) Do LMMs suffer from catastrophic forgetting during continual instruction tuning? 2) Can existing continual learning methods be effectively applied to continual instruction tuning of LMMs? A comprehensive study was conducted to answer these questions. First, we establish the first benchmark for continual instruction tuning of LMMs and reveal the phenomenon of catastrophic forgetting in this setup. Second, we integrate and adapt traditional continual learning approaches to this setting, demonstrating the effectiveness of these strategies to varying degrees in different scenarios. Third, we explore task-similarity dynamics between pairs of vision-language tasks and propose task-similarity-informed regularization and model expansion methods. Experimental results show that our approach can consistently boost the model's performance.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jinghan He
Haiyun Guo
Kuan Zhu
IEEE Transactions on Image Processing
Chinese Academy of Sciences
Institute of Automation
Building similarity graph...
Analyzing shared references across papers
Loading...
He et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69b64ccdb42794e3e660dfc3 — DOI: https://doi.org/10.1109/tip.2026.3671616