What type of study is this?

This is a Quantitative Study study.

October 20, 2025Open Access

Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach

Key Points

CLIP's performance improves with targeted multi-crop augmentation, allowing better localization in image analysis.
Systematic experiments reveal limitations in CLIP's capacity to interpret fine-grained visual details, relying too much on global patterns.
The proposed method activates CLIP's potential for localized feature analysis, recalibrating its attention mechanism effectively.
Extensive evaluations demonstrate that D&D improves performance in zero-shot, few-shot, and test-time adaptation contexts.

Abstract

Vision-Language Models (VLMs) like CLIP achieve cross-modal semantic alignment through contrastive learning, exhibiting robust zero-shot generalization. Traditional prompt engineering, however, predominantly relies on coarse-grained category labels, neglecting fine-grained local semantics. Existing approaches assume that VLMs inherently recognize localized visual details and attempt to enhance classification by augmenting text prompts with attribute descriptors generated by large language models. However, our systematic experiments reveal critical limitations: CLIP's strong bias toward global image patterns hinders its ability to process localized visual descriptors. To address this fundamental constraint, we propose a simple, effective, and plug-and-play solution that enables CLIP to ``See Both the Forest and the Trees." Specifically, we employ stochastic multi-crop augmentation to activate CLIP's latent capacity for localized feature analysis. By cropping only partial regions, the approach effectively constrains the model's receptive field and recalibrates its attention mechanism, thereby mitigating its inherent bias. We evaluate the proposed method under zero-shot, few-shot, and test-time adaptation settings, and extensive experiments demonstrate that D&D achieves promising performance.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Lei Xue

Zongbo Han

Guangyu Wang

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Helping CLIP See Both the Forest and the Trees: A Decomposition and Description Approach

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider