Low-power NPUs enable on-device LLM inference through efficient integer and fixed-point algebra, yet their lack of native exponential support makes Transformer softmax a critical performance bottleneck. Existing NPU kernels approximate using uniform piecewise polynomials to enable O(1) SIMD indexing, but this wastes computation by applying high-degree arithmetic indiscriminately in every segment. Conversely, fully adaptive approaches maximize statistical fidelity but introduce pipeline stalls due to comparator-based boundary search. To bridge this gap, we propose an attention distribution-aware softmax that uses Particle Swarm Optimization (PSO) to define non-uniform segments and variable polynomial degrees, prioritizing finer granularity and lower arithmetic complexity in attention-dense regions. To ensure efficiency, we snap boundaries into a 128-bin LUT, enabling O(1) retrieval of segment parameters without branching. Inference measurements show that this favors low-degree execution, minimizing exp-kernel overhead. Using TinyLlama-1.1B-Chat as a testbed, the proposed weighted design reduces cycles per call exp kernel (CPC) by 18.5% versus an equidistant uniform Degree-4 baseline and 13.1% versus uniform Degree-3, while preserving ranking fidelity. These results show that grid-snapped, variable-degree approximation can improve softmax efficiency while largely preserving attention ranking fidelity, enabling accurate edge LLM inference.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sadheerthan et al. (Fri,) studied this question.
www.synapsesocial.com/papers/69bf390ac7b3c90b18b433e9 — DOI: https://doi.org/10.3390/electronics15061312
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:
Sanoop Sadheerthan
Min-Jie Hsu
Chih-Hsiang Huang
Electronics
National Taiwan Normal University
Tamkang University
Building similarity graph...
Analyzing shared references across papers
Loading...