Dense self-attention assigns strictly positive weights to all tokens within the context window via the softmax operation, regardless of their semantic relevance. As a result, representations aggregate information from both relevant and irrelevant tokens, and this effect compounds across heads and layers in deep Transformer architectures. Building on the rank collapse analysis of Dong et al., we formalize how such accumulation contributes to progressive representational homogenization in dense attention models. We further hypothesize that this loss of representational distinctiveness may be related to degradation phenomena observed in long-context language modeling, including hallucination-like behavior and performance drops reported in prior work. While this connection remains conjectural, we provide a mechanistic interpretation grounded in information propagation through attention layers. To address these limitations, we propose DSALT (Dynamic Sparse Attention with Landmark Tokens), a sparse attention mechanism that combines local windowed attention with a small set of dynamically selected global landmark tokens. Landmark selection is performed using a hybrid energy-based scoring function that balances representational magnitude and output relevance. By restricting attention to structured subsets of tokens, DSALT reduces redundant interactions while preserving long-range dependencies. From a computational perspective, DSALT reduces the attention complexity from O(n²d) to O(n(w + k)d), enabling more efficient scaling to long sequences while maintaining expressive contextual modeling.
Building similarity graph...
Analyzing shared references across papers
Loading...
Leonardo Cofone
Building similarity graph...
Analyzing shared references across papers
Loading...
Leonardo Cofone (Fri,) studied this question.
www.synapsesocial.com/papers/69f6e6e68071d4f1bdfc78a2 — DOI: https://doi.org/10.5281/zenodo.19954051