Key points are not available for this paper at this time.
Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv
Building similarity graph...
Analyzing shared references across papers
Loading...
Zuhri et al. (Thu,) studied this question.
www.synapsesocial.com/papers/68e64e8bb6db6435875df297 — DOI: https://doi.org/10.48550/arxiv.2406.09297
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Zayd Muhammad Kawakibi Zuhri
Muhammad Farid Adilazuarda
Ayu Purwarianti
Building similarity graph...
Analyzing shared references across papers
Loading...