What question did this study set out to answer?

This research aims to improve the efficiency of large vision-language models while maintaining their performance by reducing irrelevant visual tokens without sacrificing task-critical information.

May 8, 2026

FinePruner: Unbiased Attention-Head-Level Fine-grained Token Reduction for Efficient Inference of Large Vision-Language Models

Key Points

This research aims to improve the efficiency of large vision-language models while maintaining their performance by reducing irrelevant visual tokens without sacrificing task-critical information.
Developed FinePruner, which consists of Instruction-Agnostic Clustering and Attention-Refined Pruning stages.
Conducted comparative studies to identify the impact of visual tokens on task performance with careful preservation of critical tokens.
Performed experiments on various VQA and fine-grained benchmarks to validate effectiveness.
FinePruner outperforms state-of-the-art token reduction methods in accuracy-efficiency tradeoffs.
Achieved a significant reduction in computational costs while preserving task-relevant tokens.
Demonstrated the ability to mitigate attention biases effectively across different layers and attention heads.

Abstract

Large Vision-Language Models (LVLMs) suffer from the high computational cost of the attention mechanism caused by the large number of visual tokens. Token reduction has emerged as a promising approach to reduce the complexity by eliminating redundant visual tokens. However, existing token reduction methods struggle to preserve task-relevant tokens and eliminate irrelevant ones. This is due to the attention biases of LVLMs, where tokens with high attention scores are not always the critical ones. Such biases force existing methods into a dilemma: they face either high performance degradation or limited inference acceleration. This issue becomes more severe in fine-grained perception tasks, which rely heavily on the fine-grained information stored in specific visual tokens. To address the above issue, we propose an unbiased fine-grained token reduction method named FinePruner, which explores the attention patterns of LVLMs at the attention-head-level to mitigate the interference of attention biases. Concretely, we first conducted comparative studies to validate the impact of tokens corresponding to visual objects on final task performance, which established the conclusion that these tokens should be preserved while others can be pruned. Also, a series of visualizations unveils the changing patterns of LVLMs' attention biases across layers and attention heads. Based on the patterns of attention biases, the pipeline of FinePruner is divided into two stages. The first stage, named Instruction-Agnostic Clustering, clusters visual tokens into groups according to their embeddings to exclude the attention biases. The second stage, named Attention-Refined Pruning, selects attention heads with less bias by the divergence, which are used to identify the preserved tokens. Experiments on VQA benchmarks and fine-grained benchmarks demonstrate that our FinePruner achieves better accuracy-efficiency tradeoffs than state-of-the-art methods. The code is available at https: //github.com/PKU-ICST-MIPL/FinePruner TIP2026.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Zishuo Wang

Xiangtian Zheng

Y Peng

Journals

IEEE Transactions on Image Processing

Actions

Institutions

Peking University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

FinePruner: Unbiased Attention-Head-Level Fine-grained Token Reduction for Efficient Inference of Large Vision-Language Models

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study