The Transformer, with its global self-attention mechanism, has become a foundational architecture for natural language processing and general sequence modeling. However, the quadratic time and space complexity of standard self-attention poses significant computational and memory bottlenecks for long-sequence scenarios. At the same time, the parameter explosion caused by deep stacking limits deployability under resource-constrained conditions. Existing research typically alleviates these issues from two separate directions: one line of work reduces attention complex-itythroughsparsification, low-rankapproximation, orkernelmethods; an-otherlinereducesparameterredundancyviacross-layer parameter sharing or recurrent updates. The problem is that these two technical routes are mostly independent, lacking a unified framework that simultaneously addresses computational efficiency, parameter efficiency, and deep representational power. The proposal of the Transformer and its sub-sequent efficient variants, including Reformer, Longformer, BigBird, Per-former, Linformer, as well as parameter-sharing approaches like Universal Transformer and ALBERT, collectively form the direct background of this work.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yizhou Huang
Building similarity graph...
Analyzing shared references across papers
Loading...
Yizhou Huang (Sun,) studied this question.
www.synapsesocial.com/papers/69af95b470916d39fea4d80c — DOI: https://doi.org/10.5281/zenodo.18908525