The scale of large language models (LLMs) continues to grow in response to increasing demands for intelligent applications. When these large models and their intermediate results, such as key-value (KV) caches, are deployed in resource-constrained environments like edge inference scenarios, they impose substantial pressure on computational and storage resources, resulting in significant performance degradation and storage inefficiency. To address the problem, this paper proposes a novel Multi-Tier Dynamic Storage (MTDS) framework that offloads KV caches from limited GPU VRAM to a hierarchical storage system, effectively reducing both memory and computation overhead on the GPU. By introducing a selective KV cache reuse mechanism, MTDS achieves notable improvements in inference performance. We further develop a dynamic storage access control scheme and an adaptive hierarchical eviction strategy to address the challenges of bandwidth contention and capacity overhead introduced by multi-tier storage under limited resources. These techniques significantly alleviate performance bottlenecks and reduce resource waste in edge inference servers. Experimental results demonstrate that MTDS improves LLM inference efficiency, reduces first-token latency by more than 25%, and increases multi-tier active storage cache hit rate by up to 20%.
Wang et al. (Tue,) studied this question.