June 21, 2024Open Access

Optimised Grouped-Query Attention Mechanism for Transformers

Key Points

Key points are not available for this paper at this time.

Abstract

Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Chen et al. (Fri,) studied this question.

www.synapsesocial.com/papers/68e63e20b6db6435875cfb23 — DOI: https://doi.org/10.48550/arxiv.2406.14963

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Yuang Chen

Cheng Zhang

Xitong Gao

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Optimised Grouped-Query Attention Mechanism for Transformers

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion