June 11, 2025

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges

Key Points

Key points are not available for this paper at this time.

Abstract

Multimodal Vision Language Models (VLMs) have emerged as a transformative topic at the intersection of computer vision and natural language processing, enabling machines to perceive and reason about the world through both visual and textual modalities. For example, models such as CLIP 180, Claude 10, and GPT-4V 228 demonstrate strong reasoning and understanding abilities on visual and textual data and beat classical single modality vision models on zero-shot classification 87. With their rapid advancements in research and growing popularity in various applications, we provide a comprehensive survey of VLMs. Specifically, we provide a systematic overview of VLMs in the following aspects: 1 model information of the major VLMs developed up to 2025; 2 the transition of VLM architectures and the newest VLM alignment methods; 3 summary and categorization of the popular benchmarks and evaluation metrics of VLMs; 4 the challenges and issues faced by current VLMs such as hallucination, alignment, and safety.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Li et al. (Wed,) studied this question.

www.synapsesocial.com/papers/6a0898611e0fcf4a43e8db89 — DOI: https://doi.org/10.1109/cvprw67362.2025.00147

Authors

Zongxia Li

Xiyang Wu

Hongyang Du

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

A Survey of State of the Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion