Key points are not available for this paper at this time.
Multimodal Vision Language Models (VLMs) have emerged as a transformative topic at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP 180, Claude 10, and GPT-4V 228 demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification 87. With their rapid advancements in research and growing popularity in various applications, we provide a comprehensive survey of VLMs. Specifically, we provide a systematic overview of VLMs in the following aspects: 1 model information of the major VLMs developed up to 2025; 2 the transition of VLM architectures and the newest VLM alignment methods; 3 summary and categorization of the popular benchmarks and evaluation metrics of VLMs; 4 the challenges and issues faced by current VLMs such as hallucination, alignment, and safety.
Building similarity graph...
Analyzing shared references across papers
Loading...
Li et al. (Wed,) studied this question.
www.synapsesocial.com/papers/6a0898611e0fcf4a43e8db89 — DOI: https://doi.org/10.1109/cvprw67362.2025.00147
Zongxia Li
Xiyang Wu
Hongyang Du
Building similarity graph...
Analyzing shared references across papers
Loading...