What question did this study set out to answer?

The aim is to understand catastrophic forgetting in language models by exploring its relationship with scale and batch size.

April 10, 2026Open Access

The Geometry of Catastrophic Forgetting: Scale and Batch Size Reveal a Two-Dimensional Behavioral Shift in Language Models

Key Points

The aim is to understand catastrophic forgetting in language models by exploring its relationship with scale and batch size.
Conducted controlled experiments on models with different parameter scales (0.5B to 14B) and batch sizes (1 to 16).
Measured the magnitude of catastrophic forgetting during fine-tuning.
Analyzed the geometric relationship between forgetting magnitude, model scale, and batch size.
Forgetting magnitude decreases as model scale increases.
A V-shaped curve characterizes the relationship between forgetting and batch size at larger scales.
A behavioral shift in the relationship is observed between 3B and 7B parameters, indicating emergent abilities.

Abstract

Catastrophic forgetting — the degradation of previously learned knowledge during fine-tuning on new domains — is universally treated as a problem to minimize. We propose treating its magnitude as a structured signal instead. Through controlled experiments across model scales (0.5B to 14B parameters) and batch sizes (1 to 16), we discover that forgetting magnitude forms a two-dimensional surface with non-trivial geometry: monotonically decreasing with scale, and exhibiting a V-shaped curve with respect to batch size at larger scales. A qualitative behavioral shift occurs between 3B and 7B parameters, where the relationship between batch size and forgetting changes character. This shift coincides with the scale range where emergent abilities have been empirically observed in prior work, suggesting a possible connection worth further investigation. Our findings imply that optimal batch size is a predictable function of model scale, potentially replacing expensive grid search with a principled selection rule.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yi Zhang

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

The Geometry of Catastrophic Forgetting: Scale and Batch Size Reveal a Two-Dimensional Behavioral Shift in Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study