What question did this study set out to answer?

This research aims to optimize the Mixture of Experts (MoE) models by addressing issues of knowledge redundancy and token overflow during training.

March 14, 2026Open Access

Consensus-Expert DynamicMoE: ARIMA-based Capacity Prediction with Adaptive Load Balancing for Sparse Models

Key Points

This research aims to optimize the Mixture of Experts (MoE) models by addressing issues of knowledge redundancy and token overflow during training.
Proposed a dynamic capacity allocation strategy in MoE structures.
Utilized the ARIMA time series prediction algorithm to forecast future load distribution.
Reconstructed the loss function to improve load balancing without strict constraints.
Conducted experiments on the Switch Transformer with 16 and 32 experts across GLUE tasks.
Achieved an average accuracy of 81.55% across 9 GLUE benchmark tasks, improving by 1.07% over baseline.
Notable improvements of up to 1.8% were observed on imbalanced datasets like MRPC.
Demonstrated method effectiveness with varying expert numbers of 16 and 32.

Abstract

Mixture of Experts (MoE) models reduce computational costs through a sparse activation strategy, making it widely adopted in large-scale model training. However, MoE structures still face issues such as knowledge redundancy and token overflow caused by static routing. Existing solutions typically involve increasing expert capacity or applying stricter loss constraints to balance expert loads. These approaches can lead to wasted computational resources due to complex loss functions and free expert capacity. This paper proposes an optimization strategy that combines consensus experts with dynamic capacity allocation. Firstly, in the MoE structure of the consensus expert, the ARIMA time series prediction algorithm is used to predict future load distribution, and the experts’ dynamic capacity is allocated to alleviate routing fluctuations during the initial training phase. Secondly, by comparing the predicted load with the actual load, the loss function is reconstructed to avoid pursuing absolute balance. Experiments on the Switch Transformer-base-8 model show that our method achieves an average accuracy of 81.55% across 9 GLUE benchmark tasks—an improvement of 1.07% over the baseline. The gains are notable on imbalanced datasets (e.g., MRPC), with improvements of up to 1.8%. Further experiments demonstrate the method’s effectiveness at expert numbers of 16 and 32.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jia-Lin Wen

Xiao-Jun Li

Junping Yao

Journals

International Journal of Computational Intelligence Systems

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Consensus-Expert DynamicMoE: ARIMA-based Capacity Prediction with Adaptive Load Balancing for Sparse Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study