Large Vision-Language Models (LVLMs) suffer from the high computational cost of the attention mechanism caused by the large number of visual tokens. Token reduction has emerged as a promising approach to reduce the complexity by eliminating redundant visual tokens. However, existing token reduction methods struggle to preserve task-relevant tokens and eliminate irrelevant ones. This is due to the attention biases of LVLMs, where tokens with high attention scores are not always the critical ones. Such biases force existing methods into a dilemma: they face either high performance degradation or limited inference acceleration. This issue becomes more severe in fine-grained perception tasks, which rely heavily on the fine-grained information stored in specific visual tokens. To address the above issue, we propose an unbiased fine-grained token reduction method named FinePruner, which explores the attention patterns of LVLMs at the attention-head-level to mitigate the interference of attention biases. Concretely, we first conducted comparative studies to validate the impact of tokens corresponding to visual objects on final task performance, which established the conclusion that these tokens should be preserved while others can be pruned. Also, a series of visualizations unveils the changing patterns of LVLMs' attention biases across layers and attention heads. Based on the patterns of attention biases, the pipeline of FinePruner is divided into two stages. The first stage, named Instruction-Agnostic Clustering, clusters visual tokens into groups according to their embeddings to exclude the attention biases. The second stage, named Attention-Refined Pruning, selects attention heads with less bias by the divergence, which are used to identify the preserved tokens. Experiments on VQA benchmarks and fine-grained benchmarks demonstrate that our FinePruner achieves better accuracy-efficiency tradeoffs than state-of-the-art methods. The code is available at https: //github.com/PKU-ICST-MIPL/FinePruner TIP2026.
Building similarity graph...
Analyzing shared references across papers
Loading...
Zishuo Wang
Xiangtian Zheng
Y Peng
IEEE Transactions on Image Processing
Peking University
Building similarity graph...
Analyzing shared references across papers
Loading...
Wang et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69fd7ddcbfa21ec5bbf060ca — DOI: https://doi.org/10.1109/tip.2026.3687073