CARL-MoE technical report / preprint. Sparse Mixture-of-Experts (MoE) models increase parameter capacity without proportional per-token computation by activating only a subset of experts for each token Shazeer et al., 2017; Fedus et al., 2022; Du et al., 2022. In practice, however, training efficiency is often limited by three coupled issues: topology-oblivious routing, uneven expert utilization, and expensive expert-parallel communication Lepikhin et al., 2021; Rajbhandari et al., 2022; Gale et al., 2023. Prior work has frequently improved routing, balancing, or distributed execution in isolation. This separation can create a mismatch between token-expert affinity and the actual cost of dispatching tokens across a heterogeneous cluster. We present a unified framework for efficient MoE training that combines: (i) communication-aware routing, which adjusts router utilities using estimated dispatch cost; (ii) adaptive dual-level load balancing, which regularizes both expert-level and group-level load and adjusts balancing strength based on observed skew; and (iii) communication-aware expert parallelism, including locality-biased hierarchical routing, a short Sinkhorn-based warm start, and periodic expert placement refresh using accumulated routing statistics. The contribution is primarily integrative rather than a claim of first invention of any single mechanism. We formulate the method precisely, analyze its computational trade-offs, and report simulation-based experiments with exact computed values under a transparent communication model. Across the studied settings, the integrated method reduces simulated communication cost and load skew relative to topology-oblivious baselines while preserving routing selectivity. These results support the broader systems-ML thesis that MoE routing should be co-designed with cluster topology rather than optimized independently. Existing OSF archival DOI: 10.17605/OSF.IO/3MF56; Existing OSF archival page: https://osf.io/3mf56/. Files include the technical report PDF and the LaTeX source tarball when available.
Building similarity graph...
Analyzing shared references across papers
Loading...
Haopeng Jin
Beijing University of Posts and Telecommunications
Building similarity graph...
Analyzing shared references across papers
Loading...
Haopeng Jin (Mon,) studied this question.
www.synapsesocial.com/papers/69ec5aa788ba6daa22dac24a — DOI: https://doi.org/10.5281/zenodo.19712472
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: