Video Diffusion Transformers (VDiTs) have achieved remarkable progress in high-quality video generation, but remain computationally expensive due to the quadratic complexity of attention over high-dimensional video sequences. Recent attention acceleration methods leverage the sparsity of attention patterns to improve efficiency; however, they often overlook inefficiencies of redundant long-range interactions. To address this problem, we propose VORTA, an acceleration framework with two novel components: 1) a sparse attention mechanism that efficiently captures long-range dependencies, and 2) a routing strategy that adaptively replaces full 3D attention with specialized sparse attention variants throughout the sampling process. It achieves a 1. 76 end-to-end speedup without quality loss on VBench. Furthermore, VORTA can seamlessly integrate with various other acceleration methods, such as caching and step distillation, reaching up to 14. 41 speedup with negligible performance degradation. VORTA demonstrates its efficiency and enhances the practicality of VDiTs in real-world settings.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sun et al. (Sat,) studied this question.
www.synapsesocial.com/papers/68da58d8c1728099cfd111e7 — DOI: https://doi.org/10.48550/arxiv.2505.18809
Wenhao Sun
Rong-Cheng Tu
Yifu Ding
Building similarity graph...
Analyzing shared references across papers
Loading...