March 3, 2026

Adaptive Progressive Fine-Tuning of VLMs for Long-Tailed Multimodal Retrieval

Key Points

APFT method significantly improves text-to-image performance metrics on challenging long-tailed labels, showcasing 19.9% relative enhancement in MAP@10.
Adaptive layer unfreezing is driven by real-time metrics like loss volatility, allowing dynamic adjustments during the training process.
The implementation of a cosine annealing scheduler and increasing weight decay optimizes regularization of newly unfringed parameters.
The study suggests that APFT effectively navigates the challenge of adapting large models to specialized domains while maintaining overall knowledge.

Abstract

Adapting large VLMs to specialized, long-tailed domains requires a careful balance between performance and the preservation of pretrained knowledge. Although full parameter fine-tuning is powerful, it is resource-intensive and can easily overfit on imbalanced data. We propose Adaptive Progressive Fine-Tuning (APFT), a strategy that automates this complex process. APFT employs a staged layer unfreezing process guided by an event-triggered mechanism; instead of relying on a fixed schedule, phase transitions are automatically initiated based on real-time training stability metrics like loss volatility and performance plateaus. Upon transition, a cosine annealing scheduler is re-initialized, and weight decay is adaptively increased to regularize the newly trainable parameters. Experiments on the long-tailed HISTORY-X4 archival dataset indicate that APFT significantly outperforms all baselines, including full fine-tuning and LoRA. The advantage is most pronounced on tailed labels, where our APFT method achieves a 19. 9 \% relative improvement in text-to-image m A P @ 10 over the strongest baseline, demonstrating its ability to effectively adapt to new domains while preserving foundational knowledge.

Bookmark

Adaptive Progressive Fine-Tuning of VLMs for Long-Tailed Multimodal Retrieval

Key Points

Abstract

Cite This Study