Post-deployment unlearning in large language models should remove targeted knowledge while preserving nearby capabilities, maintaining generation quality, and remaining stable under low-bit post-training quantization. Standard gradient-based unlearning exhibits three failure modes: neighbor collapse, a token-generation tradeoff (which couples forgetting to generation degeneration), and partial forgetting regression under INT4. We present Quantization-Robust Orthogonal Unlearning (Q-ROU), which jointly addresses these failures with (i) Active Retention (a KL constraint anchoring neighbor outputs to a frozen reference), (ii) a bounded KL-to-uniform forget objective with layer-localized updates (SLUG), and (iii) QuantNoise and structural regularizers that steer optimization toward quantization-stable solutions. On the 28-probe 3B multi-entity stress test, the non-AR baselines GA, GradDiff, and RepBend collapse neighbor retention (0/8), whereas Q-ROU achieves 27/28 in FP16 and 28/28 in INT4. Depth probing and adversarial extraction audits provide strong evidence of representational-level suppression beyond simple keyword masking, with results validated across 3B and 8B models on TOFU and publicly available personal-fact settings. This multilevel evidence indicates that knowledge removal is consistent with representational-level erasure, not merely surface-level keyword suppression.
Sode et al. (Mon,) studied this question.