This research proposes a hybrid approach that combines linear attention, chunking, and recurrent mechanisms to address the efficiency issues of Large Language Models (LLMs) within the traditional transformer framework. Our approach integrates three key innovations: We use linear attention to employ kernel function mapping to reduce time and space complexity from O (n²) to O (n) ; The proposed dynamic chunk-based processing, can compress 5 times KV cache with mean pooling; Through 3 different ways, our hard thresholding, adaptive gating, and hierarchical chunking, can filter token and reduce load. The result shows that it can actually improve the efficiency of LLM, and performs excellently among some evaluation tools. Experiments demonstrate that our 3. 2B parameter model achieves excellent performance in multiple benchmark tests, outperforming dense models of similar scale and even matching the performance of larger models in certain tasks, which provides a theoretically grounded and empirically validated framework for efficient LLM optimization.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69db365c4fe01fead37c484d — DOI: https://doi.org/10.1007/s40747-026-02290-8
Cheng Zhang
Linlin Shen
Yudong Li
Complex & Intelligent Systems
Tsinghua University
Shenzhen University
Zhejiang Lab
Building similarity graph...
Analyzing shared references across papers
Loading...