Los puntos clave no están disponibles para este artículo en este momento.
Transformers reign supreme in natural language processing, representing a milestone innovation in deep learning. For high-performance model inference, optimizing the time-consuming attention module is crucial. Owing to the irregular-shaped matrix workloads and intricate data access patterns, the attention operator is bounded by memory bandwidth. Existing works utilize kernel fusion to reduce memory access overhead, resulting in promising performance enhancements. However, these efforts primarily focus on GPU or X86 architectures, leaving ARM multi-cores, commonly encountered in emerging HPC systems, insufficiently explored. We present MEATTEN, a memory-efficient attention fusion scheme and batched approach to exploit ARM multi-core CPUs effectively. It builds on fused micro-kernels and a new data layout suitable for SIMD vectorization. An analytic model is used to guide loop permutation, tiling, and batched parallelization according to the on-chip hierarchical memory architecture and workload characterization. We apply MEATTEN to three representative ARM multi-cores against state-of-the-art libraries and compilers. Experimental results demonstrate that our approach consistently outperforms prior approaches across various evaluation scenarios and platforms.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiao Fu
Weiling Yang
Dezun Dong
National University of Defense Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Fu et al. (Thu,) studied this question.
www.synapsesocial.com/papers/68e67aa1b6db643587604fe6 — DOI: https://doi.org/10.1145/3650200.3656620
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: