Low-bitrate KV-cache compression under fixed block codecs requires the quan- tization blocks to capture attention-relevant structure. We show that the ex- pected attention-weighted distortion under a block codec depends only on the block-diagonal of the transformed query covariance, and derive three conse- quences: (1) dense mixing transforms (random rotation, Hadamard) push the block-compatibility index (BCI) toward its worst-case value 1− m/d, destroying block-local structure; (2) a fixed offline permutation derived from query statistics (ProfilePerm) directly minimizes BCI by packing coupled dimensions into shared blocks; (3) alignment improves distortion when the BCI reduction exceeds the codec’s rate-dependent noise floor, which decreases exponentially with bitrate. Across 51 layer-level slices on four model families, dense mixing degrades 1 bit/dim logit MSE by up to 60% and is harmed on average at every tested bi- trate, while ProfilePerm helps most reliably at 4 bit/dim and selectively in lower- rate, higher-anisotropy regimes. For practical layer ranking we use a block-outlier compatibility index, BOCI, which combines relative BCI reduction with the gain in activation outlier concentration measured by positive excess kurtosis under the same permutation. The same signal extends to recurrent-state compression with gains up to 62.9%. On the systems side, permutation is O(d) versus O(d2) for dense rotation, and the Triton block-score kernel achieves 3.1 to 6.7× speedup
Building similarity graph...
Analyzing shared references across papers
Loading...
GAURAV SAINI
Building similarity graph...
Analyzing shared references across papers
Loading...
GAURAV SAINI (Wed,) studied this question.
www.synapsesocial.com/papers/69d896566c1944d70ce07a65 — DOI: https://doi.org/10.5281/zenodo.19470100