Los puntos clave no están disponibles para este artículo en este momento.
We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption. Due to the issue with cumulative summation operations (cumsum), previous linear attention implementations cannot achieve their theoretical advantage in a casual setting. However, this issue can be effectively solved by utilizing different attention calculation strategies to compute the different parts of attention. Specifically, we split the attention calculation into intra-blocks and inter-blocks and use conventional attention computation for intra-blocks and linear attention kernel tricks for inter-blocks. This eliminates the need for cumsum in the linear attention calculation. Furthermore, a tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention. We conduct rigorous testing on standard and self-collected datasets with varying model sizes and sequence lengths. TNL is notably more efficient than other language models. In addition, benchmark results indicate that TNL performs on par with state-of-the-art LLMs utilizing conventional transformer structures. The source code is released at github.com/OpenNLPLab/TransnormerLLM.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhen Qin
Weigao Sun
Dong Li
Building similarity graph...
Analyzing shared references across papers
Loading...
Qin et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e68593b6db64358760dfa2 — DOI: https://doi.org/10.48550/arxiv.2405.17381
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: