What question did this study set out to answer?

This research investigates how dense self-attention leads to noise accumulation and rank collapse, affecting token representations.

April 22, 2026Open Access

Noise Accumulation and Rank Collapse in Dense Self-Attention: DSALT

Key Points

This research investigates how dense self-attention leads to noise accumulation and rank collapse, affecting token representations.
Formalized the noise accumulation effects in dense self-attention.
Developed DSALT to allow dynamic attention to a local window with landmark tokens.
Utilized a hybrid scoring mechanism for selecting landmark tokens.
Showed that noise accumulation accelerates rank collapse in dense attention.
DSALT significantly reduces time complexity from O(n²d) to O(n(w+k)d).
Demonstrated improved trade-off between computational efficiency and contextual representation over static sparsity.

Abstract

Dense self-attention is structurally unable to ignore irrelevant tokens. The softmax operation guarantees that every token in the context window receives a strictly positive attention weight, regardless of semantic relevance. Across H heads and L layers, these unavoidable contributions accumulate and systematically corrupt token representations. Building on the rank collapse analysis of Dong et al., we formalize this accumulation and show that it not only accompanies rank collapse but actively accelerates it. Dense attention, by construction, drives its own representational homogenization. We conjecture that this progressive loss of semantic distinctiveness contributes to hallucinations in large language models, offering a mechanistic explanation for long-context performance degradation observed by Liu et al. To address the issue at its source, we propose DSALT (Dynamic Sparse Attention with Landmark Tokens). Instead of attending to all previous tokens, each token attends to an adaptive local window along with a small set of globally informative landmark tokens. These landmarks are selected dynamically through a hybrid energy-based scoring mechanism that balances latent feature magnitude with the informational impact of value vectors. DSALT removes irrelevant token contributions before they are computed, preserves essential long-range dependencies without redundancy, and reduces time complexity from quadratic to near-linear, specifically from O(n²d) to O(n(w+k)d). This yields a more effective trade-off between computational efficiency and contextual representation compared to static sparsity patterns.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Leonardo Cofone

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Noise Accumulation and Rank Collapse in Dense Self-Attention: DSALT

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study