What question did this study set out to answer?

The aim is to develop a diagnostic framework for assessing risks associated with Delta-PQ compression in LLM KV caches.

April 10, 2026Open Access

Delta-PQ for LLM KV-Cache Quantization: Conditional Risk Diagnostics under Empirical Covariance Envelopes

Key Points

The aim is to develop a diagnostic framework for assessing risks associated with Delta-PQ compression in LLM KV caches.
Introduced a theoretical framework for Delta-PQ including stability analysis and risk diagnostics.
Developed a covariance-contraction inequality for assessing closed-loop stability.
Proposed a three-zone monitoring policy to manage operational risks in KV caches.
Established mathematical foundations for monitoring risks in compressed KV caches during inference.
Identified specific challenges posed by Rotary Positional Embeddings in KV caches.
Outlined an ongoing empirical validation process using perplexity metrics.

Abstract

This technical report presents a theoretical diagnostic framework for Delta-PQ, a compression strategy for Large Language Model (LLM) Key-Value (KV) caches that combines temporal delta encoding with product quantization (PQ). While delta encoding can significantly reduce quantization distortion by exploiting temporal coherence in activations, it introduces risks such as covariance-domination failure, source-shape mismatch, and closed-loop instability. This paper formalizes these risks into observable diagnostic quantities, providing a rigorous mathematical foundation for monitoring the health of compressed KV caches during inference. Key Contributions: Closed-Loop Stability Analysis: Derives a conditional covariance-contraction inequality that provides a stability certificate for delta-encoded feedback loops. Dimension-Aware Shape Audit: Proposes a multi-tiered diagnostic flow for assessing distributional shape deviation (etaₘ) using nonparametric estimators and Generalized Gaussian (GGD) proxies. Operational Risk Framework: Defines a "three-zone" (Green/Yellow/Red) monitoring policy based on a real-time risk level (muₘ, t), featuring event-triggered key-frame reset semantics to prevent error explosion in long-context decoding. Scope Definition: Provides specific treatment for Value Caches and identifies the structural challenges posed by Rotary Positional Embeddings (RoPE) in Key Caches. Current Status (v7): This version constitutes a complete theoretical framework. It includes the full mathematical derivations for stability certificates and the proposed operational checklist for deployment. Note: Empirical validation against downstream task metrics (e. g. , Perplexity on production-scale models) is identified as the primary next step and is currently ongoing. Target Audience: Researchers and engineers working on LLM inference optimization, quantization, and efficient systems for long-context language modeling.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Han Bo Jun (Wed,) studied this question.

www.synapsesocial.com/papers/69d896406c1944d70ce07924 — DOI: https://doi.org/10.5281/zenodo.19468551

Delta-PQ for LLM KV-Cache Quantization: Conditional Risk Diagnostics under Empirical Covariance Envelopes

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion