Key points are not available for this paper at this time.
We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.
Building similarity graph...
Analyzing shared references across papers
Loading...
Niklas Muennighoff
Luca Soldaini
Dirk Groeneveld
Building similarity graph...
Analyzing shared references across papers
Loading...
Muennighoff et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68e597d2b6db6435875323ba — DOI: https://doi.org/10.48550/arxiv.2409.02060
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: