Transformer-based large language models (LLM) are increasingly deployed in high-performance computing environments, where the attention mechanism often becomes a key bottleneck during inference. Although state-of-the-art attention algorithms (e.g., FlashAttention) achieve high efficiency on GPUs, they are ill-suited to emerging heterogeneous many-core processors. In this work, we focus on MT-3000, a representative architecture deployed in the new-generation Tianhe supercomputer, and identify three principal challenges in realizing high-performance attention: complex multi-tier memory requiring manual data movement, excessive reduction overhead caused by sub-tile softmax operations, and static execution pipelines that fail to adapt to inference phases and sequence lengths. To overcome these challenges, we propose DeferAttention , a high-performance attention implementation designed for the MT-3000 many-core processor. DeferAttention introduces a novel deferred-reduction attention strategy to decouple reduction from the fused compute pipeline, enabling more efficient aggregation over large tiles. Moreover, DeferAttention adopts a memory-centric operator design, including data tiling, multi-level software pipelining, and modular micro-kernels, to maximize data reuse and execution throughput. Finally, to support runtime-adaptive execution, DeferAttention integrates a lightweight kernel selection strategy guided by an analytical cost model. Experimental results show that DeferAttention achieves up to 98% of the theoretical peak at the micro-kernel level and 85% at the operator level, outperforming baseline implementations and significantly accelerating end-to-end inference.
Qi et al. (Mon,) studied this question.