What question did this study set out to answer?

This research aims to develop a scalable local learning framework for large language models by eliminating global dependencies in backpropagation.

April 15, 2026Open Access

Product of Experts as Scalable Local Learning: From Theory to Production LLM Training

Key Points

This research aims to develop a scalable local learning framework for large language models by eliminating global dependencies in backpropagation.
Introduced the Product of Experts (PoE) framework to enable local learning.
Conducted theoretical and empirical validations on various neural network architectures.
Evaluated performance metrics on datasets like MNIST, CIFAR-10, and WikiText-2.
PoE local learning achieved 98.00% accuracy on MNIST, surpassing backpropagation's 97.80%.
Demonstrated 2.1x performance improvement over EWC in continual learning tasks.
Achieved a 6.6% BPB gap for 897M-parameter model trained on ClimbMix dataset.

Abstract

Backpropagation requires global state: a single loss propagated through all layers, activations retained across the entire network, and gradient synchronization across all parameters. We show that the Product of Experts (PoE) framework provides a principled foundation for local learning that eliminates this global dependency while maintaining competitive quality across four architecture families (MLPs, CNNs, ResNets, Transformers) and scales (30M to 897M parameters). Three results structure the paper: 1. Theory and small-scale validation: PoE local learning matches or exceeds backpropagation on MNIST (98.00% vs 97.80%), closes to within 1.25% on CIFAR-10, and achieves 12% gap on WikiText-2. On continual learning, PoE achieves 2.1x the performance of EWC. 2. Systems advantages: Eliminating global state enables lossless layer pruning, bubble-free pipeline parallelism (eliminating up to 99% of pipeline bubbles), elastic depth scaling, and parallel stage branching (stage-level Mixture of Experts). 3. Production scaling via clustered local learning: Grouping layers into multi-layer stages with intra-stage gradient flow and inter-stage detachment achieves a 6.6% BPB gap on a 897M-parameter GPT trained on 5.2B tokens of ClimbMix-400B, with 1.33x training overhead. Despite the BPB gap, the PoE model demonstrates superior factual recall (4 wins vs 3 losses on 26 QA prompts) and enables adaptive compute via stage prefix pruning (25% compute answers 62.5% of queries correctly).

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jaepil Jeong

Actions

Institutions

Cognizant (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Product of Experts as Scalable Local Learning: From Theory to Production LLM Training

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study