July 20, 2024Open Access

Audio-visual training for improved grounding in video-text LLMs

Key Points

Key points are not available for this paper at this time.

Abstract

Recent advances in multimodal LLMs, have led to several video-text models being proposed for critical video-related tasks. However, most of the previous works support visual input only, essentially muting the audio signal in the video. Few models that support both audio and visual input, are not explicitly trained on audio data. Hence, the effect of audio towards video understanding is largely unexplored. To this end, we propose a model architecture that handles audio-visual inputs explicitly. We train our model with both audio and visual data from a video instruction-tuning dataset. Comparison with vision-only baselines, and other audio-visual models showcase that training on audio data indeed leads to improved grounding of responses. For better evaluation of audio-visual models, we also release a human-annotated benchmark dataset, with audio-aware question-answer pairs.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Sagare et al. (Sat,) studied this question.

www.synapsesocial.com/papers/68e5fa6bb6db64358758ee9a — DOI: https://doi.org/10.48550/arxiv.2407.15046

Authors

Shivprasad Rajendra Sagare

S Hemachandran

Kinshuk Sarabhai

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Audio-visual training for improved grounding in video-text LLMs

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Also consider