Dense self-attention is structurally unable to ignore irrelevant tokens. The softmax operation guarantees that every token in the context window receives a strictly positive attention weight, regardless of semantic relevance. Across H heads and L layers, these unavoidable contributions accumulate and systematically corrupt token representations. Building on the rank collapse analysis of Dong et al., we formalize this accumulation and show that it not only accompanies rank collapse but actively accelerates it. Dense attention, by construction, drives its own representational homogenization. We conjecture that this progressive loss of semantic distinctiveness contributes to hallucinations in large language models, offering a mechanistic explanation for long-context performance degradation observed by Liu et al. To address the issue at its source, we propose DSALT (Dynamic Sparse Attention with Landmark Tokens). Instead of attending to all previous tokens, each token attends to an adaptive local window along with a small set of globally informative landmark tokens. These landmarks are selected dynamically through a hybrid energy-based scoring mechanism that balances latent feature magnitude with the informational impact of value vectors. DSALT removes irrelevant token contributions before they are computed, preserves essential long-range dependencies without redundancy, and reduces time complexity from quadratic to near-linear, specifically from O(n²d) to O(n(w+k)d). This yields a more effective trade-off between computational efficiency and contextual representation compared to static sparsity patterns.
Building similarity graph...
Analyzing shared references across papers
Loading...
Leonardo Cofone
Building similarity graph...
Analyzing shared references across papers
Loading...
Leonardo Cofone (Mon,) studied this question.
www.synapsesocial.com/papers/69e865fd6e0dea528ddea624 — DOI: https://doi.org/10.5281/zenodo.19664011