Key points are not available for this paper at this time.
Data selection has emerged as a core issue for large-scale visual-language model pretaining (e. g. , CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e. g. , CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce negCLIPLoss, a CLIP loss-inspired method that adds the alignment between one sample and its contrastive pairs as an extra normalization term for better quality measurement. Secondly, when downstream tasks are known, we propose a new norm-based metric, NormSim, to measure the similarity between pretraining data and target data. We test our methods on the data selection benchmark, DataComp~gadre2023datacomp. Compared to the best baseline using only OpenAI's CLIP-L/14, our methods achieve a 5. 3\% improvement on ImageNet-1k and a 2. 8\% improvement on 38 downstream evaluation tasks. Moreover, both negCLIPLoss and NormSim are compatible with existing techniques. By combining our methods with the current best methods DFN~fang2023data and HYPE~kim2024hype, we can boost average performance on downstream tasks by 0. 9\%, achieving a new state-of-the-art.
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e67f72b6db64358760922a — DOI: https://doi.org/10.48550/arxiv.2405.19547
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Yiping Wang
Yifang Chen
Wendan Yan
Building similarity graph...
Analyzing shared references across papers
Loading...