Large language models based on the Transformer decoder architecture perform multi-head self-attention over all previous tokens in the context window. We argue that this dense attention mechanism introduces a systematic form of noise: every token, regardless of semantic relevance, contributes a strictly positive weight to every other token's representation via the softmax operation. This noise accumulates across attention heads and layers, progressively corrupting token representations. We show that this accumulation accelerates the rank collapse phenomenon established by Dong et al. in which self-attention networks converge doubly exponentially to a rank-1 matrix with depth. We conjecture that this mechanism is a structural cause of hallucinations in large language models, consistent with empirical evidence on long-context degradation. To address this, we propose Dynamic Sparse Attention with Landmark Tokens (DSALT), a mechanism that replaces dense attention with an adaptive local window augmented by a small set of globally informative tokens, reducing noise at its source while preserving essential long-range dependencies, with implications for both model reliability and computational efficiency at scale.
Building similarity graph...
Analyzing shared references across papers
Loading...
Leonardo Cofone (Sun,) studied this question.
www.synapsesocial.com/papers/69cb6589e6a8c024954b98c4 — DOI: https://doi.org/10.5281/zenodo.19312826
Leonardo Cofone
Building similarity graph...
Analyzing shared references across papers
Loading...