March 3, 2026Open Access

Multi-tier dynamic storage of KV cache for LLM inference under resource-constrained conditions

Key Points

Improved LLM inference efficiency, reducing first-token latency by more than 25%.
Multi-Tier Dynamic Storage offloads key-value caches from GPU VRAM to hierarchical systems, decreasing overhead.
The selective KV cache reuse mechanism notably enhances cache hit rates by up to 20%.
Dynamic access control and eviction strategies address challenges linked to bandwidth contention and capacity.

Abstract

The scale of large language models (LLMs) continues to grow in response to increasing demands for intelligent applications. When these large models and their intermediate results, such as key-value (KV) caches, are deployed in resource-constrained environments like edge inference scenarios, they impose substantial pressure on computational and storage resources, resulting in significant performance degradation and storage inefficiency. To address the problem, this paper proposes a novel Multi-Tier Dynamic Storage (MTDS) framework that offloads KV caches from limited GPU VRAM to a hierarchical storage system, effectively reducing both memory and computation overhead on the GPU. By introducing a selective KV cache reuse mechanism, MTDS achieves notable improvements in inference performance. We further develop a dynamic storage access control scheme and an adaptive hierarchical eviction strategy to address the challenges of bandwidth contention and capacity overhead introduced by multi-tier storage under limited resources. These techniques significantly alleviate performance bottlenecks and reduce resource waste in edge inference servers. Experimental results demonstrate that MTDS improves LLM inference efficiency, reduces first-token latency by more than 25%, and increases multi-tier active storage cache hit rate by up to 20%.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Wang et al. (Tue,) studied this question.

synapsesocial.com/papers/69a75a79c6e9836116a20555 https://doi.org/https://doi.org/10.1007/s40747-025-02200-4

Bookmark

View Full Paper