Optimizing Inference in Large Language Diffusion Mixture-of-Experts via Hardware-Aware KernelsThis work addresses the critical performance bottlenecks in diffusion-based Mixture-of-Experts (MoE) models, specifically focusing on the Large Language Diffusion with Masking (LLaDA) architecture. Due to the iterative nature of the denoising process, standard MoE implementations suffer from significant host-device synchronization overhead and fragmented memory access. We propose FastLLaDAMoE, an optimized framework that utilizes a Sort-Compute-Scatter pipeline and expert weight stacking to ensure contiguous GPU memory access.Experimental evaluations on NVIDIA A100 hardware demonstrate a 1.89x reduction in CUDA execution time and a 1.93x improvement in memory bandwidth utilization while maintaining full numerical parity with the baseline. By transitioning the MoE forward pass from a memory-bound, CPU-bottlenecked state to a hardware-saturated regime, this work makes large-scale iterative alignment (e.g., GRPO) computationally feasible for diffusion-based language models.
Building similarity graph...
Analyzing shared references across papers
Loading...
Alexey Manakonov
Building similarity graph...
Analyzing shared references across papers
Loading...
Alexey Manakonov (Thu,) studied this question.
www.synapsesocial.com/papers/69994cd2873532290d021a1a — DOI: https://doi.org/10.5281/zenodo.18704883
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: