Key points are not available for this paper at this time.
Large language models (LLMs) play a crucial role in various Natural Language Processing (NLP) tasks, prompting their deployment on mobile devices for inference. However, a significant challenge arises due to high waiting latency, especially for long prompts. This paper introduces mllm-NPU, the first system enabling efficient on-device LLM prefilling acceleration using on-chip Neural Processing Units (NPUs). Despite the impressive compute capabilities of NPUs, direct application to LLM prefilling often falls short. To this end, mllm-NPU incorporates two key techniques: (1) chunk-wise CPU-NPU co-scheduling to handle static compute graphs and INT8-only acceleration problems. (2) dynamic outlier inference to deal with static activation quantization sacrificing accuracy problem.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xu et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e66608b6db6435875f29bf — DOI: https://doi.org/10.1145/3662006.3662066
Daliang Xu
Hao Zhang
Liming Yang
Peking University
Beijing University of Posts and Telecommunications
Beijing Jiaotong University
Building similarity graph...
Analyzing shared references across papers
Loading...