While pre-trained models, such as large language models, can achieve high performance with minimal fine-tuning, the source datasets used for pre-training often contain irrelevant or blackundant data, which can degrade performance on target tasks. Domain Adaptation Information Gain (DAIG)-based source data selection improves performance by pre-training on source data selected based on rough prior knowledge obtained from target data in advance. However, DAIG’s key component, the transition matrix, lacks flexibility and is limited to handling only single-label classification tasks. To address this limitation, we propose the Generalized DAIG (GDAIG)-guided selection process, a novel framework that extends DAIG to support multi-label classification. GDAIG introduces a soft transition matrix to capture inter-label dependencies and employs binary cross-entropy loss to enable adaptation to multi-label data. By leveraging “rough prior knowledge” from initial training on target data, GDAIG actively selects informative and task-relevant source data for pre-training. Experiments on medical image and general object classification datasets demonstrate that GDAIG consistently outperforms baseline approaches, with particularly significant improvements in scenarios involving label mismatch between source and target domains (partial or no label overlap), where conventional transfer learning methods suffer from noise caused by irrelevant source labels. These results highlight GDAIG’s ability to enhance the effectiveness of pre-trained models through strategic source data selection, thereby optimizing performance for specific target tasks. Our framework goes beyond existing approaches that rely solely on pre-trained models, emphasizing the direct utilization of task-relevant source data. Furthermore, GDAIG provides a practical and effective solution for domains with scarce labeled data, such as medical image analysis. • A GDAIG-guided data selection strategy for multi-label classification is proposed. • GDAIG improves target model performance through task-relevant multi-label data selection. • A probabilistic transition matrix captures inter-label dependencies. • “Rough prior” from target data effectively guides source data pre-training. • GDAIG outperforms conventional baselines across diverse multi-label scenarios.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kanyu Miyoshi
Ryotaro Shimizu
Linxin Song
Neurocomputing
University of Southern California
Waseda University
Building similarity graph...
Analyzing shared references across papers
Loading...
Miyoshi et al. (Sun,) studied this question.
www.synapsesocial.com/papers/69bf8692f665edcd009e8ecf — DOI: https://doi.org/10.1016/j.neucom.2026.133405