February 12, 2024Open Access

细粒度专家混合模型的扩展规律

Key Points

Key points are not available for this paper at this time.

Abstract

专家混合（Mixture of Experts, MoE）模型已成为降低大型语言模型计算成本的主要解决方案。在本研究中，我们分析了其扩展特性，涵盖更广泛的变量范围。具体而言，我们引入了一个新的超参数——粒度，其调整实现了对专家规模的精确控制。在此基础上，我们建立了细粒度MoE的扩展规律，考虑了训练令牌数量、模型规模和粒度。利用这些规律，我们推导出在特定计算预算下的最优训练配置。我们的研究发现，不仅MoE模型始终优于密集型Transformer模型，而且随着模型规模和训练预算的提升，密集模型与MoE模型之间的效率差距进一步扩大。此外，我们还证明了MoE中将专家规模设定为与前馈层相同大小的常见做法在几乎任何计算预算下都非最优。

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jakub Krajewski

Jan Ludziejewski

Kamil Adamczewski

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

细粒度专家混合模型的扩展规律

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider