What question did this study set out to answer?

The central aim is to improve the efficiency of Transformer softmax in low-power neural processing units (NPUs) for on-device inference of large language models (LLMs).

March 22, 2026Open Access

Attention Distribution-Aware Softmax for NPU-Accelerated On-Device Inference of LLMs: An Edge-Oriented Approximation Design

Key Points

The central aim is to improve the efficiency of Transformer softmax in low-power neural processing units (NPUs) for on-device inference of large language models (LLMs).
Proposed an attention distribution-aware softmax using particle swarm optimization (PSO) to define non-uniform segments.
Implemented a lookup table (LUT) with 128 bins for efficient parameter retrieval.
Focused on minimizing arithmetic complexity in attention-dense regions to enhance performance.
Reduced cycles per call for the exp kernel by 18.5% compared to a uniform Degree-4 baseline.
Achieved a reduction of 13.1% compared to a uniform Degree-3 setup.
Maintained ranking fidelity during performance enhancements.

Abstract

Low-power NPUs enable on-device LLM inference through efficient integer and fixed-point algebra, yet their lack of native exponential support makes Transformer softmax a critical performance bottleneck. Existing NPU kernels approximate using uniform piecewise polynomials to enable O(1) SIMD indexing, but this wastes computation by applying high-degree arithmetic indiscriminately in every segment. Conversely, fully adaptive approaches maximize statistical fidelity but introduce pipeline stalls due to comparator-based boundary search. To bridge this gap, we propose an attention distribution-aware softmax that uses Particle Swarm Optimization (PSO) to define non-uniform segments and variable polynomial degrees, prioritizing finer granularity and lower arithmetic complexity in attention-dense regions. To ensure efficiency, we snap boundaries into a 128-bin LUT, enabling O(1) retrieval of segment parameters without branching. Inference measurements show that this favors low-degree execution, minimizing exp-kernel overhead. Using TinyLlama-1.1B-Chat as a testbed, the proposed weighted design reduces cycles per call exp kernel (CPC) by 18.5% versus an equidistant uniform Degree-4 baseline and 13.1% versus uniform Degree-3, while preserving ranking fidelity. These results show that grid-snapped, variable-degree approximation can improve softmax efficiency while largely preserving attention ranking fidelity, enabling accurate edge LLM inference.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Sadheerthan et al. (Fri,) studied this question.

www.synapsesocial.com/papers/69bf390ac7b3c90b18b433e9 — DOI: https://doi.org/10.3390/electronics15061312

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Authors

Sanoop Sadheerthan

Min-Jie Hsu

Chih-Hsiang Huang

Journals

Electronics

Actions

Institutions

National Taiwan Normal University

Tamkang University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Attention Distribution-Aware Softmax for NPU-Accelerated On-Device Inference of LLMs: An Edge-Oriented Approximation Design

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion