March 7, 2024Open Access

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Recent advancements in Vision-Language (VL) models have sparked interest in their deployment on edge devices, yet challenges in handling diverse visual modalities, manual annotation, and computational constraints remain. We introduce EdgeVL, a novel framework that bridges this gap by seamlessly integrating dual-modality knowledge distillation and quantization-aware contrastive learning. This approach enables the adaptation of large VL models, like CLIP, for efficient use with both RGB and non-RGB images on resource-limited devices without the need for manual annotations. EdgeVL not only transfers visual language alignment capabilities to compact models but also maintains feature quality post-quantization, significantly enhancing open-vocabulary classification performance across various visual modalities. Our work represents the first systematic effort to adapt large VL models for edge deployment, showcasing up to 15.4% accuracy improvements on multiple datasets and up to 93-fold reduction in model size.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Cai et al. (Thu,) studied this question.

www.synapsesocial.com/papers/68e7567db6db6435876cdfce — DOI: https://doi.org/10.48550/arxiv.2403.04908

Authors

Kaiwen Cai

Zhekai Duan

Gaowen Liu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Self-Adapting Large Visual-Language Models to Edge Devices across Visual Modalities

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion