Backpropagation requires global state: a single loss propagated through all layers, activations retained across the entire network, and gradient synchronization across all parameters. We show that the Product of Experts (PoE) framework provides a principled foundation for local learning that eliminates this global dependency while maintaining competitive quality across four architecture families (MLPs, CNNs, ResNets, Transformers) and scales (30M to 897M parameters). Three results structure the paper: 1. Theory and small-scale validation: PoE local learning matches or exceeds backpropagation on MNIST (98.00% vs 97.80%), closes to within 1.25% on CIFAR-10, and achieves 12% gap on WikiText-2. On continual learning, PoE achieves 2.1x the performance of EWC. 2. Systems advantages: Eliminating global state enables lossless layer pruning, bubble-free pipeline parallelism (eliminating up to 99% of pipeline bubbles), elastic depth scaling, and parallel stage branching (stage-level Mixture of Experts). 3. Production scaling via clustered local learning: Grouping layers into multi-layer stages with intra-stage gradient flow and inter-stage detachment achieves a 6.6% BPB gap on a 897M-parameter GPT trained on 5.2B tokens of ClimbMix-400B, with 1.33x training overhead. Despite the BPB gap, the PoE model demonstrates superior factual recall (4 wins vs 3 losses on 26 QA prompts) and enables adaptive compute via stage prefix pruning (25% compute answers 62.5% of queries correctly).
Building similarity graph...
Analyzing shared references across papers
Loading...
Jaepil Jeong
Cognizant (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Jaepil Jeong (Mon,) studied this question.
www.synapsesocial.com/papers/69df2c9ee4eeef8a2a6b1d68 — DOI: https://doi.org/10.5281/zenodo.19547653