As LLMs scale toward million-token contexts, KV cache memory becomes the dominant bottleneck. Existing pruning methods like Top-K eviction discard tokens based on current attention scores — an assumption that leads to unpredictable reconstruction failures at structurally important positions. This paper proposes the SRC (Selection-Reconstruction-Compression) pipeline, which summarizes rather than discards tokens. Low-salience, high-entropy tokens are routed to a Recycle Bin, reconstructed via OLS against the current query matrix, and compressed into compact centroid tokens using SVD. Experiments show HAE achieves up to 3× lower reconstruction error than Top-K at a 30% keep ratio while using less total memory.
Building similarity graph...
Analyzing shared references across papers
Loading...
Jayanth Chandra
Building similarity graph...
Analyzing shared references across papers
Loading...
Jayanth Chandra (Sun,) studied this question.
www.synapsesocial.com/papers/69e866896e0dea528ddeaeed — DOI: https://doi.org/10.5281/zenodo.19657329