April 6, 2024Open Access

SqueezeAttention: 레이어별 최적 예산을 통한 LLM 추론에서 KV-캐시의 2D 관리

Key Points

Key points are not available for this paper at this time.

Abstract

대규모 언어 모델(LLM)의 키-값(KV) 캐시를 최적화하는 것은 추론 비용 절감에 매우 중요하다고 여겨져 왔습니다. 기존의 대부분 KV-캐시 압축 알고리즘은 토큰의 중요도가 다름을 활용하여 시퀀스 내 토큰을 희소화하려 시도했습니다. 본 연구에서는 어텐션 레이어의 중요도를 식별함으로써 KV-캐시를 두 차원에서 함께 최적화할 수 있음을 발견했습니다. 추론 시 레이어별 중요도에 대한 관찰을 바탕으로, 우리는 SqueezeAttention을 제안하여 KV-캐시 예산을 레이어별로 실시간으로 정밀하게 할당하고, 각 레이어별 예산 내에서 세 가지 대표적인 토큰 희소화 알고리즘을 통합하여 KV-캐시를 압축합니다. 시퀀스와 레이어 두 차원에서 KV-캐시를 최적화함으로써, SqueezeAttention은 다양한 LLM과 벤치마크에서 약 30%에서 70%의 메모리 절감과 최대 2.2배의 처리량 향상을 달성합니다. 코드는 https://github.com/hetailang/SqueezeAttention 에서 확인할 수 있습니다.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zihao Wang

Shaoduo Gan

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

SqueezeAttention: 레이어별 최적 예산을 통한 LLM 추론에서 KV-캐시의 2D 관리

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider