June 24, 2024Open Access

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

Key Points

Key points are not available for this paper at this time.

Abstract

Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters. The source codes and models are available at https://github.com/pjlab-sys4nlp/llama-moe .

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Tong Zhu

Xiaoye Qu

Daize Dong

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider