March 18, 2024Open Access

VideoAgent：ビデオ理解のためのメモリ拡張型マルチモーダルエージェント

Key Points

Key points are not available for this paper at this time.

Abstract

複数の基盤モデル（大規模言語モデルおよび視覚言語モデル）を新規の統一メモリ機構と統合することで、特に長尺ビデオにおける長期的時間関係を捉えるという困難なビデオ理解課題に取り組む方法を検討します。具体的には、提案するマルチモーダルエージェントVideoAgentは、1）ビデオの一般的な時間的イベント記述とオブジェクト中心の追跡状態の両方を格納する構造化メモリを構築し、2）入力タスククエリに対してビデオセグメントの位置特定やオブジェクトメモリ照会を含むツールと他の視覚基盤モデルを用い、LLMのゼロショットツール使用能力を活用して対話的にタスクを解決します。VideoAgentは、複数の長期的ビデオ理解ベンチマークで優れた性能を示し、NExT-QAで平均6.6%、EgoSchemaで26.0%のベースラインに対する増加を達成し、オープンソースモデルとGemini 1.5 Proを含むプライベートモデルとの差を縮めています。

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yue Fan

Xiaojian Ma

Rujie Wu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

VideoAgent：ビデオ理解のためのメモリ拡張型マルチモーダルエージェント

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider