Despite the significant breakthrough of Mixture-of-Experts (MoE), the increasing scale of these MoE models presents huge memory and storage challenges. Existing MoE pruning methods, which involve reducing parameter size with a uniform sparsity across all layers, often lead to suboptimal outcomes and performance degradation due to varying expert redundancy in different MoE layers. To address this, we propose a non-uniform pruning strategy, dubbed Differentiable Expert Pruning (DiEP), which adaptively adjusts pruning rates at the layer level while jointly learning inter-layer importance, effectively capturing the varying redundancy across different MoE layers. By transforming the global discrete search space into a continuous one, our method handles exponentially growing non-uniform expert combinations, enabling adaptive gradient-based pruning. Extensive experiments on five advanced MoE models demonstrate the efficacy of our method across various NLP tasks. Notably, DiEP retains around 92\% of original performance on Mixtral 87B with only half the experts, outperforming other pruning methods by up to 7. 1\% on the challenging MMLU dataset.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sikai Bai
H. J. Li
Jie Zhang
Building similarity graph...
Analyzing shared references across papers
Loading...
Bai et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68e040eda99c246f578b3452 — DOI: https://doi.org/10.48550/arxiv.2509.16105
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: