June 13, 2024Open Access

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Key Points

Key points are not available for this paper at this time.

Abstract

Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at https://github.com/zaydzuhri/pythia-mlkv

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Zuhri et al. (Thu,) studied this question.

www.synapsesocial.com/papers/68e64e8bb6db6435875df297 — DOI: https://doi.org/10.48550/arxiv.2406.09297

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Reducing Transformer Key-Value Cache Size with Cross-Layer Attention· 2024 · 2 citations
Layer-Condensed KV Cache for Efficient Inference of Large Language Models· 2024
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference· 2024 · 1 citations
ALISA: Accelerating Large Language Model Inference via Sparsity-Aware KV Caching· 2024 · 1 citations
Effectively Compress KV Heads for LLM· 2024

Authors

Zayd Muhammad Kawakibi Zuhri

Muhammad Farid Adilazuarda

Ayu Purwarianti

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion