April 16, 2024

Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences

Key Points

Key points are not available for this paper at this time.

Abstract

Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the O ( n 2 ) complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm's approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T's performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

H. Wang

Donglin Yang

Yaqi Xia

Journals

IEEE Transactions on Computers

Actions

Institutions

Wuhan University

University of Macau

Nvidia (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Raptor-T: A Fused and Memory-Efficient Sparse Transformer for Long and Variable-Length Sequences

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider