June 21, 2024Open Access

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25\% absolute accuracy improvements on the video-QA task and over 30\% absolute accuracy improvements on audio-visual QA tasks with human speech. In addition, video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs. Our training code and model checkpoints are available at https: //github. com/bytedance/SALMONN/.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Sun et al. (Fri,) studied this question.

www.synapsesocial.com/papers/68e63e20b6db6435875cfb8b — DOI: https://doi.org/10.48550/arxiv.2406.15704

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Guangzhi Sun

Wenyi Yu

Changli Tang

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Puntos clave

Resumen

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion