Mixture of Experts (MoE) models reduce computational costs through a sparse activation strategy, making it widely adopted in large-scale model training. However, MoE structures still face issues such as knowledge redundancy and token overflow caused by static routing. Existing solutions typically involve increasing expert capacity or applying stricter loss constraints to balance expert loads. These approaches can lead to wasted computational resources due to complex loss functions and free expert capacity. This paper proposes an optimization strategy that combines consensus experts with dynamic capacity allocation. Firstly, in the MoE structure of the consensus expert, the ARIMA time series prediction algorithm is used to predict future load distribution, and the experts’ dynamic capacity is allocated to alleviate routing fluctuations during the initial training phase. Secondly, by comparing the predicted load with the actual load, the loss function is reconstructed to avoid pursuing absolute balance. Experiments on the Switch Transformer-base-8 model show that our method achieves an average accuracy of 81.55% across 9 GLUE benchmark tasks—an improvement of 1.07% over the baseline. The gains are notable on imbalanced datasets (e.g., MRPC), with improvements of up to 1.8%. Further experiments demonstrate the method’s effectiveness at expert numbers of 16 and 32.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jia-Lin Wen
Xiao-Jun Li
Junping Yao
International Journal of Computational Intelligence Systems
Building similarity graph...
Analyzing shared references across papers
Loading...
Wen et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69b4fc0eb39f7826a300cb5f — DOI: https://doi.org/10.1007/s44196-026-01236-9