March 18, 2024Open Access

VideoAgent: 비디오 이해를 위한 메모리 증강 다중모달 에이전트

Key Points

Key points are not available for this paper at this time.

Abstract

여러 기초 모델(대형 언어 모델 및 비전-언어 모델)을 새로운 통합 메모리 메커니즘과 결합하여 특히 긴 영상에서 장기 시간 관계를 포착하는 어려운 비디오 이해 문제를 해결할 수 있는 방법을 탐구합니다. 구체적으로, 제안된 다중모달 에이전트 VideoAgent는: 1) 영상의 일반적인 시간 이벤트 설명과 객체 중심 추적 상태를 저장하는 구조화된 메모리를 구성하며; 2) 입력 작업 쿼리가 주어지면 비디오 세그먼트 위치 지정과 객체 메모리 쿼리 등 도구와 기타 시각 기초 모델을 사용하여 LLM의 제로샷 도구 사용 능력을 활용해 상호작용적으로 작업을 해결합니다. VideoAgent는 여러 장기 비디오 이해 벤치마크에서 인상적인 성능을 보여주며, NExT-QA에서 기준치 대비 6.6%, EgoSchema에서 26.0%의 평균 향상을 기록하여 공개 모델과 Gemini 1.5 Pro를 포함한 비공개 모델 간의 격차를 좁히고 있습니다.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yue Fan

Xiaojian Ma

Rujie Wu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

VideoAgent: 비디오 이해를 위한 메모리 증강 다중모달 에이전트

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider