May 14, 2024Open Access

Improving Transformers with Dynamically Composable Multi-Head Attention

Key Points

Key points are not available for this paper at this time.

Abstract

Multi-Head Attention (MHA) is a key component of Transformer. In MHA, attention heads work independently, causing problems such as low-rank bottleneck of attention score matrices and head redundancy. We propose Dynamically Composable Multi-Head Attention (DCMHA), a parameter and computation efficient attention architecture that tackles the shortcomings of MHA and increases the expressive power of the model by dynamically composing attention heads. At the core of DCMHA is a Compose function that transforms the attention score and weight matrices in an input-dependent way. DCMHA can be used as a drop-in replacement of MHA in any transformer architecture to obtain the corresponding DCFormer. DCFormer significantly outperforms Transformer on different architectures and model scales in language modeling, matching the performance of models with ~1. 7x-2. 0x compute. For example, DCPythia-6. 9B outperforms open source Pythia-12B on both pretraining perplexity and downstream task evaluation. The code and models are available at https: //github. com/Caiyun-AI/DCFormer.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Da et al. (Tue,) studied this question.

www.synapsesocial.com/papers/68e6a4e2b6db643587627ad4 — DOI: https://doi.org/10.48550/arxiv.2405.08553

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Xiao Da

Qingye Meng

Shengping Li

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Improving Transformers with Dynamically Composable Multi-Head Attention

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion