Although video generation and editing models have advanced significantly, individual models remain restricted to specific tasks, often failing to meet diverse user needs. Effectively coordinating these models in pipelines can unlock a wide range of video generation and editing capabilities. However, manual orchestration is complex, time-consuming, and requires deep expertise in model performance and limitations. To address these challenges, we propose the Semantic Planning Agent (SPAgent), a novel system that automatically coordinates state-of-the-art open-source models to fulfill complex user intents. To equip SPAgent with robust orchestration capabilities, we introduce a three-step framework: (1) decoupled intent recognition to accurately parse multi-modal inputs; (2) principle-guided route planning to design effective execution chains; and (3) capability-based model selection to identify the optimal tools for each sub-task. To facilitate training, we curate a comprehensive multi-task generative video dataset. Furthermore, we enhance SPAgent with a video quality evaluation module, enabling it to autonomously assess and incorporate new models into its tool library without human intervention. Experimental results demonstrate that SPAgent effectively coordinates models to generate and edit high-quality videos, exhibiting superior versatility and adaptability across various tasks.
Tu et al. (Thu,) studied this question.