February 26, 2024Open Access

Vision-Language Models for Vision Tasks: A Survey

Key Points

Key points are not available for this paper at this time.

Abstract

Most visual recognition studies rely heavily on crowd-labelled data in deep neural networks (DNNs) training, and they usually train a DNN for each single visual recognition task, leading to a laborious and time-consuming visual recognition paradigm. To address the two challenges, Vision-Language Models (VLMs) have been intensively investigated recently, which learns rich vision-language correlation from web-scale image-text pairs that are almost infinitely available on the Internet and enables zero-shot predictions on various visual recognition tasks with a single VLM. This paper provides a systematic review of visual language models for various visual recognition tasks, including: (1) the background that introduces the development of visual recognition paradigms; (2) the foundations of VLM that summarize the widely-adopted network architectures, pre-training objectives, and downstream tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4) the review and categorization of existing VLM pre-training methods, VLM transfer learning methods, and VLM knowledge distillation methods; (5) the benchmarking, analysis and discussion of the reviewed methods; (6) several research challenges and potential research directions that could be pursued in the future VLM studies for visual recognition.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zhang et al. (Mon,) studied this question.

www.synapsesocial.com/papers/68e777acb6db6435876ec960 — DOI: https://doi.org/10.1109/tpami.2024.3369699

Also consider

Synapse has enriched 2 closely related papers on similar clinical questions. Consider them for comparative context:

PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model Pretraining· 2022 · 42 citations
ProTeCt: Prompt Tuning for Taxonomic Open Set Classification· 2023 · 1 citations

Authors

J Zhang

Jiaxing Huang

Sheng Jin

Journals

IEEE Transactions on Pattern Analysis and Machine Intelligence

Actions

Institutions

Nanyang Technological University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Vision-Language Models for Vision Tasks: A Survey

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion