What does this research mean for the field?

Integrating communication-aware routing, adaptive load balancing, and communication-aware expert parallelism reduces communication cost and load skew in Mixture-of-Experts (MoE) training compared to topology-oblivious baselines. Novelty: ClaimNovelty.METHODOLOGICAL. Consensus alignment: ConsensusAlignment.SUPPORTS_CONSENSUS.

What question did this study set out to answer?

The central aim is to improve the efficiency of sparse mixture-of-experts training by addressing routing and communication challenges.

April 25, 2026Open Access

CARL-MoE: Communication-Aware Adaptive Routing with Load-Balanced Expert Parallelism for Efficient Mixture-of-Experts Training

Key Points

The central aim is to improve the efficiency of sparse mixture-of-experts training by addressing routing and communication challenges.
Developed a unified framework combining communication-aware routing and adaptive load balancing.
Implemented a dual-level load balancing strategy to regularize expert and group-level load.
Conducted simulation-based experiments to evaluate the method's performance against topology-oblivious baselines.
Reduced simulated communication costs by a significant margin compared to existing methods.
Decreased load skew while maintaining routing selectivity across various cluster settings.

Abstract

CARL-MoE technical report / preprint. Sparse Mixture-of-Experts (MoE) models increase parameter capacity without proportional per-token computation by activating only a subset of experts for each token Shazeer et al., 2017; Fedus et al., 2022; Du et al., 2022. In practice, however, training efficiency is often limited by three coupled issues: topology-oblivious routing, uneven expert utilization, and expensive expert-parallel communication Lepikhin et al., 2021; Rajbhandari et al., 2022; Gale et al., 2023. Prior work has frequently improved routing, balancing, or distributed execution in isolation. This separation can create a mismatch between token-expert affinity and the actual cost of dispatching tokens across a heterogeneous cluster. We present a unified framework for efficient MoE training that combines: (i) communication-aware routing, which adjusts router utilities using estimated dispatch cost; (ii) adaptive dual-level load balancing, which regularizes both expert-level and group-level load and adjusts balancing strength based on observed skew; and (iii) communication-aware expert parallelism, including locality-biased hierarchical routing, a short Sinkhorn-based warm start, and periodic expert placement refresh using accumulated routing statistics. The contribution is primarily integrative rather than a claim of first invention of any single mechanism. We formulate the method precisely, analyze its computational trade-offs, and report simulation-based experiments with exact computed values under a transparent communication model. Across the studied settings, the integrated method reduces simulated communication cost and load skew relative to topology-oblivious baselines while preserving routing selectivity. These results support the broader systems-ML thesis that MoE routing should be co-designed with cluster topology rather than optimized independently. Existing OSF archival DOI: 10.17605/OSF.IO/3MF56; Existing OSF archival page: https://osf.io/3mf56/. Files include the technical report PDF and the LaTeX source tarball when available.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Haopeng Jin

Actions

Institutions

Beijing University of Posts and Telecommunications

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

CARL-MoE: Communication-Aware Adaptive Routing with Load-Balanced Expert Parallelism for Efficient Mixture-of-Experts Training

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider