Key points are not available for this paper at this time.
Transformer-based models have made significant advancements across various domains, largely due to the self-attention mechanism's ability to capture contextual relationships in input sequences. However, processing long sequences remains computationally expensive for Transformer models, primarily due to the O ( n 2 ) complexity associated with self-attention. To address this, sparse attention has been proposed to reduce the quadratic dependency to linear. Nevertheless, deploying the sparse transformer efficiently encounters two major obstacles: 1) Existing system optimizations are less effective for the sparse transformer due to the algorithm's approximation properties leading to fragmented attention, and 2) the variability of input sequences results in computation and memory access inefficiencies. We present Raptor-T, a cutting-edge transformer framework designed for handling long and variable-length sequences. Raptor-T harnesses the power of the sparse transformer to reduce resource requirements for processing long sequences while also implementing system-level optimizations to accelerate inference performance. To address the fragmented attention issue, Raptor-T employs fused and memory-efficient Multi-Head Attention. Additionally, we introduce an asynchronous data processing method to mitigate GPU-blocking operations caused by sparse attention. Furthermore, Raptor-T minimizes padding for variable-length inputs, effectively reducing the overhead associated with padding and achieving balanced computation on GPUs. In evaluation, we compare Raptor-T's performance against state-of-the-art frameworks on an NVIDIA A100 GPU. The experimental results demonstrate that Raptor-T outperforms FlashAttention-2 and FasterTransformer, achieving an impressive average end-to-end performance improvement of 3.41X and 3.71X, respectively.
Building similarity graph...
Analyzing shared references across papers
Loading...
H. Wang
Donglin Yang
Yaqi Xia
IEEE Transactions on Computers
Wuhan University
University of Macau
Nvidia (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang et al. (Tue,) studied this question.
www.synapsesocial.com/papers/68e6ee25b6db643587669696 — DOI: https://doi.org/10.1109/tc.2024.3389507
Synapse has enriched 2 closely related papers on similar clinical questions. Consider them for comparative context: