April 8, 2024Open Access

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Key Points

Key points are not available for this paper at this time.

Abstract

Mixture-of-Experts (MoE) language models can reduce computational costs by 2-4 compared to dense models without sacrificing performance, making them more efficient in computation-bounded scenarios. However, MoE models generally require 2-4 times more parameters to achieve comparable performance to a dense model, which incurs larger GPU memory requirements and makes MoE models less efficient in I/O-bounded scenarios like autoregressive generation. In this work, we propose a hybrid dense training and sparse inference framework for MoE models (DS-MoE) which achieves strong computation and parameter efficiency by employing dense computation across all experts during training and sparse computation during inference. Our experiments on training LLMs demonstrate that our DS-MoE models are more parameter-efficient than standard sparse MoEs and are on par with dense models in terms of total parameter size and performance while being computationally cheaper (activating 30-40% of the model's parameters). Performance tests using vLLM show that our DS-MoE-6B model runs up to 1. 86 faster than similar dense models like Mistral-7B, and between 1. 50 and 1. 71 faster than comparable MoEs, such as DeepSeekMoE-16B and Qwen1. 5-MoE-A2. 7B.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Pan et al. (Mon,) studied this question.

www.synapsesocial.com/papers/68e700dcb6db64358767a675 — DOI: https://doi.org/10.48550/arxiv.2404.05567

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Bowen Pan

Yikang Shen

Haokun Liu

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion