March 29, 2024Open Access

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

We train a suite of multimodal foundation models (MMFM) using the popular LLaVA framework with the recently released Gemma family of large language models (LLMs). Of particular interest is the 2B parameter Gemma model, which provides opportunities to construct capable small-scale MMFMs. In line with findings from other papers in this space, we test the effect of ablating three design features: pretraining the connector, utilizing a more powerful image backbone, and increasing the size of the language backbone. The resulting models, which we call LLaVA-Gemma, exhibit moderate performance on an array of evaluations, but fail to improve past the current comparably sized SOTA models. Closer analysis of performance shows mixed effects; skipping pretraining tends to reduce performance, larger vision models sometimes improve performance, and increasing language model size has inconsistent effects. We publicly release training recipes, code and weights for our models for the LLaVA-Gemma models.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Hinck et al. (Fri,) studied this question.

www.synapsesocial.com/papers/68e71cc2b6db6435876969df — DOI: https://doi.org/10.48550/arxiv.2404.01331

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

LLaVA-GM: lightweight LLaVA multimodal architecture· 2025
TinyLLaVA: A Framework of Small-scale Large Multimodal Models· 2024 · 15 citations
Amharic LLaMA and LLaVA: Multimodal LLMs for Low Resource Languages· 2024 · 2 citations
Efficient Multimodal Learning from Data-centric Perspective· 2024 · 12 citations
Scientific Reasoning: Assessment of Multimodal Generative LLMs· 2025

Authors

Musashi Hinck

Matthew Olson

David Cobbley

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

LLaVA-Gemma: Accelerating Multimodal Foundation Models with a Compact Language Model

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion