May 29, 2024Open Access

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

Key Points

Key points are not available for this paper at this time.

Abstract

Data selection has emerged as a core issue for large-scale visual-language model pretaining (e. g. , CLIP), particularly with noisy web-curated datasets. Three main data selection approaches are: (1) leveraging external non-CLIP models to aid data selection, (2) training new CLIP-style embedding models that are more effective at selecting high-quality data than the original OpenAI CLIP model, and (3) designing better metrics or strategies universally applicable to any CLIP embedding without requiring specific model properties (e. g. , CLIPScore is one popular metric). While the first two approaches have been extensively studied, the third remains under-explored. In this paper, we advance the third approach by proposing two new methods. Firstly, instead of classical CLIP scores that only consider the alignment between two modalities from a single sample, we introduce negCLIPLoss, a CLIP loss-inspired method that adds the alignment between one sample and its contrastive pairs as an extra normalization term for better quality measurement. Secondly, when downstream tasks are known, we propose a new norm-based metric, NormSim, to measure the similarity between pretraining data and target data. We test our methods on the data selection benchmark, DataComp~gadre2023datacomp. Compared to the best baseline using only OpenAI's CLIP-L/14, our methods achieve a 5. 3\% improvement on ImageNet-1k and a 2. 8\% improvement on 38 downstream evaluation tasks. Moreover, both negCLIPLoss and NormSim are compatible with existing techniques. By combining our methods with the current best methods DFN~fang2023data and HYPE~kim2024hype, we can boost average performance on downstream tasks by 0. 9\%, achieving a new state-of-the-art.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Wang et al. (Wed,) studied this question.

www.synapsesocial.com/papers/68e67f72b6db64358760922a — DOI: https://doi.org/10.48550/arxiv.2405.19547

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Data-Efficient Contrastive Language-Image Pretraining: Prioritizing Data Quality over Quantity· 2024 · 3 citations
SoftCLIP: Softer Cross-Modal Alignment Makes CLIP Stronger· 2024 · 37 citations
CLIP the Bias: How Useful is Balancing Data in Multimodal Learning?· 2024 · 2 citations
CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment· 2025
Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment

Authors

Yiping Wang

Yifang Chen

Wendan Yan

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion