What question did this study set out to answer?

The research aims to enhance the performance of the LLaDA-MoE architecture by reducing bottlenecks during inference.

February 21, 2026Open Access

Hardware-Saturated Denoising: Accelerating LLaDA-MoE via Permuted Expert Dispatch with benchmark data for gsm8k

Key Points

The research aims to enhance the performance of the LLaDA-MoE architecture by reducing bottlenecks during inference.
Proposed FastLLaDAMoE framework to optimize inference processes
Utilized Sort-Compute-Scatter pipeline for efficient GPU memory management
Implemented expert weight stacking for contiguous memory access
Conducted experiments on NVIDIA A100 hardware
Achieved a 1.89x reduction in CUDA execution time
Improved memory bandwidth utilization by 1.93x
Maintained full numerical parity with baseline

Abstract

Optimizing Inference in Large Language Diffusion Mixture-of-Experts via Hardware-Aware KernelsThis work addresses the critical performance bottlenecks in diffusion-based Mixture-of-Experts (MoE) models, specifically focusing on the Large Language Diffusion with Masking (LLaDA) architecture. Due to the iterative nature of the denoising process, standard MoE implementations suffer from significant host-device synchronization overhead and fragmented memory access. We propose FastLLaDAMoE, an optimized framework that utilizes a Sort-Compute-Scatter pipeline and expert weight stacking to ensure contiguous GPU memory access.Experimental evaluations on NVIDIA A100 hardware demonstrate a 1.89x reduction in CUDA execution time and a 1.93x improvement in memory bandwidth utilization while maintaining full numerical parity with the baseline. By transitioning the MoE forward pass from a memory-bound, CPU-bottlenecked state to a hardware-saturated regime, this work makes large-scale iterative alignment (e.g., GRPO) computationally feasible for diffusion-based language models.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Alexey Manakonov

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Hardware-Saturated Denoising: Accelerating LLaDA-MoE via Permuted Expert Dispatch with benchmark data for gsm8k

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider