This technical report presents a theoretical diagnostic framework for Delta-PQ, a compression strategy for Large Language Model (LLM) Key-Value (KV) caches that combines temporal delta encoding with product quantization (PQ). While delta encoding can significantly reduce quantization distortion by exploiting temporal coherence in activations, it introduces risks such as covariance-domination failure, source-shape mismatch, and closed-loop instability. This paper formalizes these risks into observable diagnostic quantities, providing a rigorous mathematical foundation for monitoring the health of compressed KV caches during inference. Key Contributions: Closed-Loop Stability Analysis: Derives a conditional covariance-contraction inequality that provides a stability certificate for delta-encoded feedback loops. Dimension-Aware Shape Audit: Proposes a multi-tiered diagnostic flow for assessing distributional shape deviation (etaₘ) using nonparametric estimators and Generalized Gaussian (GGD) proxies. Operational Risk Framework: Defines a "three-zone" (Green/Yellow/Red) monitoring policy based on a real-time risk level (muₘ, t), featuring event-triggered key-frame reset semantics to prevent error explosion in long-context decoding. Scope Definition: Provides specific treatment for Value Caches and identifies the structural challenges posed by Rotary Positional Embeddings (RoPE) in Key Caches. Current Status (v7): This version constitutes a complete theoretical framework. It includes the full mathematical derivations for stability certificates and the proposed operational checklist for deployment. Note: Empirical validation against downstream task metrics (e. g. , Perplexity on production-scale models) is identified as the primary next step and is currently ongoing. Target Audience: Researchers and engineers working on LLM inference optimization, quantization, and efficient systems for long-context language modeling.
Building similarity graph...
Analyzing shared references across papers
Loading...
Han Bo Jun (Wed,) studied this question.
www.synapsesocial.com/papers/69d896406c1944d70ce07924 — DOI: https://doi.org/10.5281/zenodo.19468551
Han Bo Jun
Building similarity graph...
Analyzing shared references across papers
Loading...