Training large language models is memory-bound: backpropagation stores all intermediate activations and maintains full-precision optimizer states, forcing reliance on expensive HBM. We propose Hybrid Pretraining: apply standard backpropagation with AdamW only to the front 25% of layers, and use single-perturbation zeroth-order (ZO) estimation with plain SGD on the remaining 75%. This removes optimizer state and parameter gradients for the majority of the model while remaining simple to implement in standard training loops. Across 6 distinct conditions spanning 4 architectures (Llama, GPT-2, Mamba, Qwen3-Next) and 3 datasets (WikiText-103, C4, FineWeb-Edu), Hybrid Pretraining matches or improves validation perplexity relative to full backpropagation in the early-training regime we study (~20M tokens, 20K steps), and saves 28-43% VRAM while running at ~0.9x wall-clock speed. A cosine-similarity analysis at initialization shows that the ZO gradient estimate is orthogonal to the true backpropagation gradient (mean cos=0.000, max 0.001), consistent with a high-variance single-sample SPSA estimator; training nonetheless converges. The perturbation count k=1 is optimal, plain SGD matches Adam for the ZO layers, and the perturbation magnitude ε is insensitive over two orders of magnitude. This is a Zenodo preprint (v1) intended as a priority record. Code: https://github.com/2264K/hybrid-zo-pretrain (Apache-2.0). Paper: CC BY 4.0.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kim (Mon,) studied this question.
www.synapsesocial.com/papers/69df2b85e4eeef8a2a6b07bc — DOI: https://doi.org/10.5281/zenodo.19559267
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Kim
Building similarity graph...
Analyzing shared references across papers
Loading...