We propose DiM-2, a diffusion architecture that directly exploits the Structured State Space Duality (SSD) of Mamba-2 for unified image and video generation. Our design introduces (1) a Dual-Axis SSD Scanner (DASS) that decouples spatial and temporal modeling using independent SSD kernels, and (2) Semi-Separable Conditioning (SSC) that injects diffusion timestep and conditioning signals via SSD structured matrices. This technical report presents the architecture design, theoretical motivation, and a proposed experimental protocol on ImageNet-256, UCF-101, and SkyTimelapse.
Hiroki Abe (Thu,) studied this question.