What question did this study set out to answer?

May 3, 2026Open Access

Noise Accumulation and Rank Collapse in Dense Self-Attention: DSALT

Key Points

This research aims to analyze the issues of noise accumulation and rank collapse in dense self-attention mechanisms and proposes a solution through DSALT.
Formalized the connection between noise accumulation and representational homogenization.
Introduced DSALT, which utilizes structured attention focusing on local and global tokens.
Analyzed computational efficiency of DSALT in comparison to traditional dense attention.
DSALT showed improved representational distinctiveness, reducing noise accumulation.
Attention complexity reduced from O(n²d) to O(n(w + k)d), enhancing scalability.
Mechanistic grounding supported by evidence of improved long-range dependency preservation.

Abstract

Dense self-attention assigns strictly positive weights to all tokens within the context window via the softmax operation, regardless of their semantic relevance. As a result, representations aggregate information from both relevant and irrelevant tokens, and this effect compounds across heads and layers in deep Transformer architectures. Building on the rank collapse analysis of Dong et al., we formalize how such accumulation contributes to progressive representational homogenization in dense attention models. We further hypothesize that this loss of representational distinctiveness may be related to degradation phenomena observed in long-context language modeling, including hallucination-like behavior and performance drops reported in prior work. While this connection remains conjectural, we provide a mechanistic interpretation grounded in information propagation through attention layers. To address these limitations, we propose DSALT (Dynamic Sparse Attention with Landmark Tokens), a sparse attention mechanism that combines local windowed attention with a small set of dynamically selected global landmark tokens. Landmark selection is performed using a hybrid energy-based scoring function that balances representational magnitude and output relevance. By restricting attention to structured subsets of tokens, DSALT reduces redundant interactions while preserving long-range dependencies. From a computational perspective, DSALT reduces the attention complexity from O(n²d) to O(n(w + k)d), enabling more efficient scaling to long sequences while maintaining expressive contextual modeling.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Leonardo Cofone

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Noise Accumulation and Rank Collapse in Dense Self-Attention: DSALT

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study