The emergence of Vision-Language Models (VLMs) like CLIP (Contrastive Language-Image Pretraining) provides appealing solutions to various vision problems including Dynamic Facial Expression Recognition (DFER). However, most of the proposed approaches face major challenges, particularly related to inefficient full fine-tuning of the encoders and the complexity of the models. Moreover, some of the proposed methods seem to struggle with suboptimal performance due to (i) poor alignment between textual and visual representations, and (ii) ineffective temporal modeling. To address these challenges, we propose PE-CLIP, a parameter-efficient fine-tuning (PEFT) framework that elegantly adapts CLIP for dynamic facial expression recognition, requiring significantly reduced number of trainable parameters while maintaining high accuracy. At its core, to enhance efficiency and performance, PE-CLIP introduces two specialized adapters namely a Temporal Dynamic Adapter (TDA) and a Shared Adapter (ShA). The temporal dynamic adapter is a GRU-based module with a dynamic scaling mechanism, capturing sequential dependencies while adaptively modulating the contribution of each temporal feature to emphasize the most informative ones while mitigating irrelevant variations. The shared adapter is a lightweight adapter refine representations within both textual and visual encoders, ensuring consistent feature processing while maintaining parameter efficiency. Additionally, we leverage Multi-modal Prompt Learning (MaPLe), which introduces learnable prompts to both visual and action unit-based textual description inputs, further improving the semantic alignment between modalities and enabling the efficient adaptation of CLIP for dynamic tasks. We evaluate our proposed PE-CLIP on two benchmark datasets, namely DFEW, FERV39K, and AFEW, achieving competitive performance compared to state-of-the-art methods while requiring fewer trainable parameters. By striking an optimal balance between parameter efficiency and performance, PE-CLIP sets a new benchmark in resource-efficient DFER. The source code of the proposed PE-CLIP will be publicly available at https://github.com/Ibtissam-SAADI/PE-CLIP .
Building similarity graph...
Analyzing shared references across papers
Loading...
Ibtissam Saadi
Abdenour Hadid
Douglas W. Cunningham
ACM Transactions on Multimedia Computing Communications and Applications
Centre National de la Recherche Scientifique
Université de Lille
Brandenburg University of Technology Cottbus-Senftenberg
Building similarity graph...
Analyzing shared references across papers
Loading...
Saadi et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69706c87b6488063ad5c19c8 — DOI: https://doi.org/10.1145/3786789