What question did this study set out to answer?

The aim is to improve token management in KV caches to enhance memory efficiency and reduce reconstruction errors.

April 22, 2026Open Access

Entropy-Guided KV Cache Summarization via Low-Rank Attention Reconstruction

Key Points

The aim is to improve token management in KV caches to enhance memory efficiency and reduce reconstruction errors.
Proposed the SRC pipeline for token summarization rather than discarding.
Utilized OLS for reconstructing low-salience tokens against the current query matrix.
Compressed reconstructed tokens using Singular Value Decomposition (SVD).
HAE achieved up to 3× lower reconstruction error compared to Top-K eviction methods.
Maintained a 30% retention ratio of tokens while reducing total memory usage.

Abstract

As LLMs scale toward million-token contexts, KV cache memory becomes the dominant bottleneck. Existing pruning methods like Top-K eviction discard tokens based on current attention scores — an assumption that leads to unpredictable reconstruction failures at structurally important positions. This paper proposes the SRC (Selection-Reconstruction-Compression) pipeline, which summarizes rather than discards tokens. Low-salience, high-entropy tokens are routed to a Recycle Bin, reconstructed via OLS against the current query matrix, and compressed into compact centroid tokens using SVD. Experiments show HAE achieves up to 3× lower reconstruction error than Top-K at a 30% keep ratio while using less total memory.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Jayanth Chandra

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Entropy-Guided KV Cache Summarization via Low-Rank Attention Reconstruction

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study