July 11, 2024Open Access

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Key Points

Key points are not available for this paper at this time.

Abstract

Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the H100 GPU. We develop three main techniques to speed up attention on Hopper GPUs: exploiting asynchrony of the Tensor Cores and TMA to (1) overlap overall computation and data movement via warp-specialization and (2) interleave block-wise matmul and softmax operations, and (3) block quantization and incoherent processing that leverages hardware support for FP8 low-precision. We demonstrate that our method, FlashAttention-3, achieves speedup on H100 GPUs by 1. 5-2. 0 with FP16 reaching up to 740 TFLOPs/s (75% utilization), and with FP8 reaching close to 1. 2 PFLOPs/s. We validate that FP8 FlashAttention-3 achieves 2. 6 lower numerical error than a baseline FP8 attention.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Shah et al. (Thu,) studied this question.

www.synapsesocial.com/papers/68e609bdb6db64358759caf5 — DOI: https://doi.org/10.48550/arxiv.2407.08608

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Jay P. Shah

Ganesh Bikshandi

Ying Zhang

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion