Key points are not available for this paper at this time.
The contrastive vision–language pre-trained model CLIP, driven by large-scale open-vocabulary image–text pairs, has recently demonstrated remarkable zero-shot generalization capabilities in diverse downstream image tasks, which has made numerous models dominated by the “image pre-training followed by fine-tuning” paradigm exhibit promising results on standard video benchmarks. However, as models scale up, full fine-tuning adaptive strategy for specific tasks becomes difficult in terms of training and storage. In this work, we propose a novel method that adapts CLIP to the video domain for efficient recognition without destroying the original pre-trained parameters. Specifically, we introduce temporal prompts to realize the object of reasoning about the dynamic content of videos for pre-trained models that lack temporal cues. Then, by replacing the direct learning style of prompt vectors with a lightweight reparameterization encoder, the model can be adapted to domain-specific adjustment to learn more generalizable representations. Furthermore, we predefine a Chinese label dictionary to enhance video representation by co-supervision of Chinese and English semantics. Extensive experiments on video action recognition benchmarks show that our method achieves competitive or even better performance than most existing methods with fewer trainable parameters in both general and few-shot recognition scenarios.
Building similarity graph...
Analyzing shared references across papers
Loading...
Lujuan Deng
Jieqing Tan
Fangmei Liu
Electronics
Zhengzhou University of Light Industry
Building similarity graph...
Analyzing shared references across papers
Loading...
Deng et al. (Thu,) studied this question.
www.synapsesocial.com/papers/68e5b5fab6db64358754efa0 — DOI: https://doi.org/10.3390/electronics13163348
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: