Catastrophic forgetting — the degradation of previously learned knowledge during fine-tuning on new domains — is universally treated as a problem to minimize. We propose treating its magnitude as a structured signal instead. Through controlled experiments across model scales (0.5B to 14B parameters) and batch sizes (1 to 16), we discover that forgetting magnitude forms a two-dimensional surface with non-trivial geometry: monotonically decreasing with scale, and exhibiting a V-shaped curve with respect to batch size at larger scales. A qualitative behavioral shift occurs between 3B and 7B parameters, where the relationship between batch size and forgetting changes character. This shift coincides with the scale range where emergent abilities have been empirically observed in prior work, suggesting a possible connection worth further investigation. Our findings imply that optimal batch size is a predictable function of model scale, potentially replacing expensive grid search with a principled selection rule.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yi Zhang
Building similarity graph...
Analyzing shared references across papers
Loading...
Yi Zhang (Mon,) studied this question.
www.synapsesocial.com/papers/69d8940c6c1944d70ce050b5 — DOI: https://doi.org/10.5281/zenodo.19446076