What question did this study set out to answer?

This research aims to optimize memory usage during the training of large language models by using hybrid methods.

April 15, 2026Open Access

Noise Over Gradients: Hybrid Backpropagation and Forward-Only Zeroth-Order Optimization for Memory-Efficient LLM Pretraining

Key Points

This research aims to optimize memory usage during the training of large language models by using hybrid methods.
Implemented hybrid pretraining with backpropagation for 25% of layers and zeroth-order for 75%.
Used AdamW optimizer for the front layers and plain SGD with single-perturbation zeroth-order for the remaining layers.
Conducted experiments across 6 conditions and 4 model architectures with 3 datasets.
Achieved comparable or improved validation perplexity compared to full backpropagation in early training phases.
Reduced VRAM usage by 28-43% during training.
Maintained similar wall-clock training speed (~0.9x) with the hybrid method.

Abstract

Training large language models is memory-bound: backpropagation stores all intermediate activations and maintains full-precision optimizer states, forcing reliance on expensive HBM. We propose Hybrid Pretraining: apply standard backpropagation with AdamW only to the front 25% of layers, and use single-perturbation zeroth-order (ZO) estimation with plain SGD on the remaining 75%. This removes optimizer state and parameter gradients for the majority of the model while remaining simple to implement in standard training loops. Across 6 distinct conditions spanning 4 architectures (Llama, GPT-2, Mamba, Qwen3-Next) and 3 datasets (WikiText-103, C4, FineWeb-Edu), Hybrid Pretraining matches or improves validation perplexity relative to full backpropagation in the early-training regime we study (~20M tokens, 20K steps), and saves 28-43% VRAM while running at ~0.9x wall-clock speed. A cosine-similarity analysis at initialization shows that the ZO gradient estimate is orthogonal to the true backpropagation gradient (mean cos=0.000, max 0.001), consistent with a high-variance single-sample SPSA estimator; training nonetheless converges. The perturbation count k=1 is optimal, plain SGD matches Adam for the ZO layers, and the perturbation magnitude ε is insensitive over two orders of magnitude. This is a Zenodo preprint (v1) intended as a priority record. Code: https://github.com/2264K/hybrid-zo-pretrain (Apache-2.0). Paper: CC BY 4.0.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Kim (Mon,) studied this question.

www.synapsesocial.com/papers/69df2b85e4eeef8a2a6b07bc — DOI: https://doi.org/10.5281/zenodo.19559267

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Noise Over Gradients: Hybrid Backpropagation and Forward-Only Zeroth-Order Optimization for Memory-Efficient LLM Pretraining

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion