June 3, 2024Open Access

WiP: Efficient LLM Prefilling with Mobile NPU

Key Points

Key points are not available for this paper at this time.

Abstract

Large language models (LLMs) play a crucial role in various Natural Language Processing (NLP) tasks, prompting their deployment on mobile devices for inference. However, a significant challenge arises due to high waiting latency, especially for long prompts. This paper introduces mllm-NPU, the first system enabling efficient on-device LLM prefilling acceleration using on-chip Neural Processing Units (NPUs). Despite the impressive compute capabilities of NPUs, direct application to LLM prefilling often falls short. To this end, mllm-NPU incorporates two key techniques: (1) chunk-wise CPU-NPU co-scheduling to handle static compute graphs and INT8-only acceleration problems. (2) dynamic outlier inference to deal with static activation quantization sacrificing accuracy problem.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Xu et al. (Mon,) studied this question.

www.synapsesocial.com/papers/68e66608b6db6435875f29bf — DOI: https://doi.org/10.1145/3662006.3662066

Authors

Daliang Xu

Hao Zhang

Liming Yang

Actions

Institutions

Peking University

Beijing University of Posts and Telecommunications

Beijing Jiaotong University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

WiP: Efficient LLM Prefilling with Mobile NPU

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion