Mixture of Experts (MoE) architectures have emerged as a fundamental framework in contemporary deep learning, enabling scalable conditional computation through the dynamic activation of a sparse subset of expert subnetworks. By decoupling capacity from computational cost, MoEs achieve unprecedented parameter efficiency while maintaining or exceeding the predictive performance of dense models. This survey presents an in-depth theoretical and empirical analysis of MoE models, with particular emphasis on their structural properties, functional capacity, and training dynamics. We formally define the general MoE function class as: \ f (x) = ₌=₁^M Gₘ (x) Eₘ (x), \ where Eₘ are expert networks and Gₘ are gating coefficients satisfying a sparsity constraint \|G (x) \|₀ k M. We explore the approximation capabilities of MoEs, proving that under mild assumptions on the gating and expert classes, such models form a universal approximator family. Furthermore, we investigate the effective capacity scaling of MoEs, showing that their VC-dimension and Rademacher complexity grow with the number of experts M, while per-example compute remains bounded by k. The survey categorizes MoE designs into hard vs. \ soft gating, static vs. \ dynamic routing, and shallow vs. \ hierarchical expert arrangements, and evaluates their impact on optimization and generalization. We analyze challenges unique to MoEs, including expert collapse, routing instability, and irregular communication overheads. Recent advances such as Switch Transformers, GShard, V-MoE, and Token Routing are reviewed in the context of these challenges. Finally, we articulate open problems and research frontiers, including optimal gating function design, continual learning via expert expansion, modular interpretability, and the theoretical limits of sparse mixture modeling. This survey aims to provide a unified mathematical foundation and future outlook for Mixture of Experts as a scalable, modular paradigm for efficient and adaptive artificial intelligence.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yusuf Midha
Harnani Husni
Fawzi Gamal
Building similarity graph...
Analyzing shared references across papers
Loading...
Midha et al. (Tue,) studied this question.
www.synapsesocial.com/papers/689523d29f4f1c896c42a056 — DOI: https://doi.org/10.20944/preprints202508.0288.v1
Synapse has enriched 2 closely related papers on similar clinical questions. Consider them for comparative context: