What question did this study set out to answer?

This work aims to optimize the attention mechanism for large language models on many-core processors, specifically the MT-3000 architecture.

April 15, 2026Open Access

Optimizing Attention for Large Language Model Inference on the MT-3000 Many-Core Processor

Key Points

This work aims to optimize the attention mechanism for large language models on many-core processors, specifically the MT-3000 architecture.
Identify challenges in attention mechanism for many-core processors
Develop DeferAttention with deferred-reduction strategy
Implement memory-centric operator design including data tiling and software pipelining
Integrate kernel selection strategy using an analytical cost model
DeferAttention achieves up to 98% theoretical peak efficiency at the micro-kernel level
Achieves 85% efficiency at the operator level
Significantly outperforms baseline attention implementations
Accelerates end-to-end inference for large language models

Abstract

Transformer-based large language models (LLM) are increasingly deployed in high-performance computing environments, where the attention mechanism often becomes a key bottleneck during inference. Although state-of-the-art attention algorithms (e.g., FlashAttention) achieve high efficiency on GPUs, they are ill-suited to emerging heterogeneous many-core processors. In this work, we focus on MT-3000, a representative architecture deployed in the new-generation Tianhe supercomputer, and identify three principal challenges in realizing high-performance attention: complex multi-tier memory requiring manual data movement, excessive reduction overhead caused by sub-tile softmax operations, and static execution pipelines that fail to adapt to inference phases and sequence lengths. To overcome these challenges, we propose DeferAttention , a high-performance attention implementation designed for the MT-3000 many-core processor. DeferAttention introduces a novel deferred-reduction attention strategy to decouple reduction from the fused compute pipeline, enabling more efficient aggregation over large tiles. Moreover, DeferAttention adopts a memory-centric operator design, including data tiling, multi-level software pipelining, and modular micro-kernels, to maximize data reuse and execution throughput. Finally, to support runtime-adaptive execution, DeferAttention integrates a lightweight kernel selection strategy guided by an analytical cost model. Experimental results show that DeferAttention achieves up to 98% of the theoretical peak at the micro-kernel level and 85% at the operator level, outperforming baseline implementations and significantly accelerating end-to-end inference.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Qi et al. (Mon,) studied this question.

www.synapsesocial.com/papers/69df2abce4eeef8a2a6afbfa — DOI: https://doi.org/10.1145/3807449

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Parallel GEMM-based convolution for deep learning on multicore RISC-V processors· 2024 · 4 citations
NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing· 2024 · 85 citations
Performance Evaluation of MindSpore and PyTorch Based on Ascend NPU· 2023 · 4 citations
DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale· 2022 · 217 citations
Efficient Memory Management for Large Language Model Serving with PagedAttention

Authors

Xinxin Qi

Jianbin Fang

Peng Zhang

Journals

ACM Transactions on Architecture and Code Optimization

Actions

Institutions

National University of Defense Technology

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Optimizing Attention for Large Language Model Inference on the MT-3000 Many-Core Processor

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion