May 30, 2024Open Access

Pixels to Phrases: Evolution of Vision Language Models

Key Points

Key points are not available for this paper at this time.

Abstract

Vision language models (VLMs) are transforming how we perceive and interact with visual data by bridging the gap between natural language understanding and visual perception. This paper provides a comprehensive overview of VLMs and their applications in text-based video retrieval and manipulation. It examines how these models leverage transformer architectures and self-attention mechanisms to learn joint representations of text and visual inputs. The paper traces the evolution of VLM pretraining techniques like ViLBERT, LXMERT, VisualBERT, and approaches for key tasks such as cross-modal retrieval, text-driven video manipulation, and zeroshot learning to generalize to unseen domains. It explores how VLMs can be integrated with other technologies like GANs and reinforcement learning to enhance capabilities. The survey also covers emerging areas like multimodal architectures that combine multiple modalities, video-language pretraining on video and text data, and generative VLMs for applications like text-to-image synthesis. Additionally, it discusses challenges and future directions related to multimodal fusion, interpretability, reasoning, and the ethical implications of developing and deploying VLMs. Overall, this paper provides a timely and comprehensive perspective on the rapidly evolving field of VLMs and their role in enabling multimodal AI systems.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Oza et al. (Thu,) studied this question.

www.synapsesocial.com/papers/68e67a9ab6db6435876048df — DOI: https://doi.org/10.36227/techrxiv.171078045.57266373/v2

Authors

Jay Oza

Gitesh Kambli

Abhijit Patil

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Pixels to Phrases: Evolution of Vision Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion