What question did this study set out to answer?

The aim is to improve the efficiency of self-attention mechanisms in video diffusion transformers by combining sparse and linear attention techniques.

April 25, 2026Open Access

SparseFlow-Attention: Denoising-Adaptive Sparse-Linear Hybrid Attention with Fused Kernel Co-Design for Efficient Video Diffusion Transformers

Key Points

The aim is to improve the efficiency of self-attention mechanisms in video diffusion transformers by combining sparse and linear attention techniques.
Proposed SparseFlow-Attention hybrid design reduces computational costs while enhancing generative performance.
Introduced a denoising-adaptive sparsity schedule to vary temporal attention based on diffusion timesteps.
Developed a fused CUDA kernel to execute both sparse and linear attention efficiently in a single pipelined launch.
SparseFlow-Attention shows improved memory efficiency and throughput compared to existing hybrid methods like SALAD and SLA.
Performance improvements are noted in FLOPs, memory traffic, and kernel overhead during experimentation.

Abstract

SparseFlow technical report / preprint. Video diffusion transformers (DiTs) achieve strong generative performance, but self-attention cost scales poorly with video length and resolution. Prior efficient-attention methods optimize either the attention algorithm or the execution kernel, but not both. Sparse and linear variants reduce nominal complexity yet yield limited end-to-end speedup under generic kernels, while FlashAttention-style methods accelerate dense attention without exploiting video-specific sparse structure. SparseFlow-Attention is proposed as a hybrid attention design for video DiTs that co-designs approximation structure and GPU execution. SparseFlow decomposes attention into two complementary components: block-sparse temporal attention across frames, motivated by temporal locality and anchor-frame structure, and linear spatial attention within frames, motivated by spatial redundancy and approximate low-rank structure. A denoising-adaptive sparsity schedule varies temporal sparsity with diffusion timestep, applying more aggressive sparsity in early steps and denser connectivity in later steps to preserve fine details. A fused CUDA kernel executes sparse temporal attention and linear spatial aggregation in a single pipelined launch, reducing memory traffic and kernel overhead relative to sequential implementations. A lightweight head-wise routing module selects the sparse or linear path per head via differentiable routing during training and hard routing at inference. Relative to hybrid attention methods such as SALAD, SLA, and VMonarch, SparseFlow's contribution lies in explicitly coupling a video-diffusion-aware hybrid operator with a fused kernel specialized for that operator. The work provides method formulation, complexity analysis, approximation and stability discussion, and simulation-based results for FLOPs, memory, and throughput, alongside analysis of observed trade-offs. Existing OSF archival DOI: 10.17605/OSF.IO/6EMFW; Existing OSF archival page: https://osf.io/6emfw/. Files include the technical report PDF and the LaTeX source tarball when available.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Haopeng Jin (Mon,) studied this question.

www.synapsesocial.com/papers/69ec5b2388ba6daa22dacbc0 — DOI: https://doi.org/10.5281/zenodo.19712508

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

SparseFlow-Attention: Denoising-Adaptive Sparse-Linear Hybrid Attention with Fused Kernel Co-Design for Efficient Video Diffusion Transformers

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion