Key points are not available for this paper at this time.
对于混合专家(Mixture-of-Experts,MoE)模型,专家负载不平衡会导致路由崩溃或计算开销增加。现有方法通常采用辅助损失来促进负载均衡,但较大的辅助损失会引入不可忽视的干扰梯度,进而损害模型性能。为了在控制负载均衡的同时避免训练中产生不良梯度,我们提出了无损失平衡(Loss-Free Balancing),其特色是无辅助损失的负载均衡策略。具体而言,在top-K路由决策之前,无损失平衡首先对每个专家的路由分数施加专家级偏置。通过根据各专家近期负载动态更新偏置,无损失平衡能够持续保持专家负载的均衡分布。此外,由于无损失平衡不产生任何干扰梯度,也提升了MoE训练所获得的模型性能上限。我们在参数规模高达30亿、训练用语料量达2000亿标记的MoE模型上验证了无损失平衡的性能。实验结果表明,无损失平衡相较传统基于辅助损失的负载均衡策略,既实现了更优的性能,也实现了更好的负载均衡。
Building similarity graph...
Analyzing shared references across papers
Loading...
Lean Wang
Huazuo Gao
Chenggang Zhao
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang 等人(Wed,)研究了该问题。
www.synapsesocial.com/papers/68e5a81fb6db643587542c9a — DOI: https://doi.org/10.48550/arxiv.2408.15664
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: