What question did this study set out to answer?

The aim is to create an automated system that effectively manages video generation and editing tasks.

March 22, 2026

SPAgent: Adaptive Task Decomposition and Model Selection for General Video Generation and Editing

Puntos clave

The aim is to create an automated system that effectively manages video generation and editing tasks.
Developed SPAgent to coordinate multiple video models.
Used a three-step framework for workflow execution.
Curated a multi-task generative video dataset for training.
Integrated a video quality evaluation module.
SPAgent successfully generates and edits high-quality videos.
The system shows increased versatility compared to manual orchestration.
Demonstrated adaptability across various video tasks.

Resumen

Although video generation and editing models have advanced significantly, individual models remain restricted to specific tasks, often failing to meet diverse user needs. Effectively coordinating these models in pipelines can unlock a wide range of video generation and editing capabilities. However, manual orchestration is complex, time-consuming, and requires deep expertise in model performance and limitations. To address these challenges, we propose the Semantic Planning Agent (SPAgent), a novel system that automatically coordinates state-of-the-art open-source models to fulfill complex user intents. To equip SPAgent with robust orchestration capabilities, we introduce a three-step framework: (1) decoupled intent recognition to accurately parse multi-modal inputs; (2) principle-guided route planning to design effective execution chains; and (3) capability-based model selection to identify the optimal tools for each sub-task. To facilitate training, we curate a comprehensive multi-task generative video dataset. Furthermore, we enhance SPAgent with a video quality evaluation module, enabling it to autonomously assess and incorporate new models into its tool library without human intervention. Experimental results demonstrate that SPAgent effectively coordinates models to generate and edit high-quality videos, exhibiting superior versatility and adaptability across various tasks.

Me gusta

Guardar

Cite This Study

Tu et al. (Thu,) studied this question.

synapsesocial.com/papers/69bf8692f665edcd009e8f31 https://doi.org/https://doi.org/10.1109/tip.2026.3673949

Me gusta

Guardar