What question did this study set out to answer?

The central aim is to enhance pre-training methods for multi-label classification by selecting more relevant source data.

March 22, 2026Open Access

Optimizing pre-training for multi-label classification via generalized target-aware source data selection

Key Points

The central aim is to enhance pre-training methods for multi-label classification by selecting more relevant source data.
Proposed Generalized DAIG (GDAIG) framework for source data selection
Developed a soft transition matrix to consider inter-label dependencies
Utilized binary cross-entropy loss for multi-label adaptation
Conducted experiments on medical image and general object datasets
GDAIG consistently outperforms baseline methods in multi-label classification
Significant improvements observed in cases with label mismatches
Enhanced performance due to strategic selection of informative source data

Abstract

While pre-trained models, such as large language models, can achieve high performance with minimal fine-tuning, the source datasets used for pre-training often contain irrelevant or blackundant data, which can degrade performance on target tasks. Domain Adaptation Information Gain (DAIG)-based source data selection improves performance by pre-training on source data selected based on rough prior knowledge obtained from target data in advance. However, DAIG’s key component, the transition matrix, lacks flexibility and is limited to handling only single-label classification tasks. To address this limitation, we propose the Generalized DAIG (GDAIG)-guided selection process, a novel framework that extends DAIG to support multi-label classification. GDAIG introduces a soft transition matrix to capture inter-label dependencies and employs binary cross-entropy loss to enable adaptation to multi-label data. By leveraging “rough prior knowledge” from initial training on target data, GDAIG actively selects informative and task-relevant source data for pre-training. Experiments on medical image and general object classification datasets demonstrate that GDAIG consistently outperforms baseline approaches, with particularly significant improvements in scenarios involving label mismatch between source and target domains (partial or no label overlap), where conventional transfer learning methods suffer from noise caused by irrelevant source labels. These results highlight GDAIG’s ability to enhance the effectiveness of pre-trained models through strategic source data selection, thereby optimizing performance for specific target tasks. Our framework goes beyond existing approaches that rely solely on pre-trained models, emphasizing the direct utilization of task-relevant source data. Furthermore, GDAIG provides a practical and effective solution for domains with scarce labeled data, such as medical image analysis. • A GDAIG-guided data selection strategy for multi-label classification is proposed. • GDAIG improves target model performance through task-relevant multi-label data selection. • A probabilistic transition matrix captures inter-label dependencies. • “Rough prior” from target data effectively guides source data pre-training. • GDAIG outperforms conventional baselines across diverse multi-label scenarios.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Kanyu Miyoshi

Ryotaro Shimizu

Linxin Song

Journals

Neurocomputing

Actions

Institutions

University of Southern California

Waseda University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Optimizing pre-training for multi-label classification via generalized target-aware source data selection

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study