August 10, 2024Open Access

Eigen Attention: KV 캐시 압축을 위한 저계수 공간에서의 어텐션

Key Points

Key points are not available for this paper at this time.

Abstract

대규모 언어 모델(LLM)은 뛰어난 추론 능력 덕분에 자연어 처리 분야에서 획기적인 발전을 이룬다. 최근에는 이러한 모델들의 맥락 길이를 늘려 복잡한 과제에 적용 가능성을 높이는 데 많은 관심이 집중되고 있다. 그러나 긴 맥락 길이와 큰 배치 크기에서는 어텐션 키와 값을 저장하는 키-값(KV) 캐시가 추론 중 메모리 사용의 새로운 병목 현상으로 떠오른다. 이를 해결하기 위해 저희는 저계수 공간에서 어텐션 연산을 수행하여 KV 캐시 메모리 오버헤드를 줄이는 Eigen Attention을 제안한다. 제안된 방법은 기존 KV 캐시 압축 기법과 직교하며 이들과 시너지 효과를 낼 수 있다. OPT, MPT, Llama 모델 군을 대상으로 한 광범위한 실험을 통해 Eigen Attention이 성능 저하를 최소화하면서 KV 캐시 크기를 최대 40%, 어텐션 연산 지연 시간을 최대 60%까지 감소시킨다는 것을 입증했다.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Utkarsh Saxena

Gobinda Saha

S. Choudhary

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Eigen Attention: KV 캐시 압축을 위한 저계수 공간에서의 어텐션

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider