February 17, 2024Open Access

Momentor：細粒度の時間的推論による動画大型言語モデルの前進

Key Points

Key points are not available for this paper at this time.

Abstract

大型言語モデル（LLM）は、テキストベースのタスクの理解と処理において顕著な能力を示しています。これらの特性を動画モダリティに転送しようとする試みが数多く存在し、これらはVideo-LLMと呼ばれています。しかし、既存のVideo-LLMは粗粒度の意味しか捉えることができず、特定の動画セグメントの理解や位置特定に関わるタスクを効果的に処理できません。これらの課題に対処するために、細粒度の時間的理解タスクを達成可能なVideo-LLMであるMomentorを提案します。Momentorのトレーニングを支援するために、自動データ生成エンジンを設計し、セグメントレベルの指示データを持つ大規模な動画指示データセットMoment-10Mを構築しました。Moment-10MでMomentorをトレーニングすることで、セグメントレベルの推論と位置特定が可能になります。いくつかのタスクにおけるゼロショット評価では、Momentorが時間的に根ざした細粒度の理解と位置特定に優れていることが示されました。

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Qian Long

Juncheng Li

Yu Wu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Momentor：細粒度の時間的推論による動画大型言語モデルの前進

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider