What question did this study set out to answer?

The research aims to improve the dynamic facial expression recognition process using a parameter-efficient fine-tuning framework.

January 21, 2026

PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition

Key Points

The research aims to improve the dynamic facial expression recognition process using a parameter-efficient fine-tuning framework.
Proposes PE-CLIP, a novel parameter-efficient fine-tuning framework for CLIP.
Introduces a Temporal Dynamic Adapter (TDA) based on GRU for capturing sequential dependencies.
Implements a Shared Adapter (ShA) for consistent feature processing in both textual and visual encoders.
Applies Multi-modal Prompt Learning (MaPLe) to enhance semantic alignment between visual and textual inputs.
Evaluates performance on benchmark datasets DFEW, FERV39K, and AFEW.
Achieves competitive performance compared to state-of-the-art methods while using fewer trainable parameters.
Demonstrates improved alignment between textual and visual representations.
Shows effective temporal modeling for dynamic facial expressions.

Abstract

The emergence of Vision-Language Models (VLMs) like CLIP (Contrastive Language-Image Pretraining) provides appealing solutions to various vision problems including Dynamic Facial Expression Recognition (DFER). However, most of the proposed approaches face major challenges, particularly related to inefficient full fine-tuning of the encoders and the complexity of the models. Moreover, some of the proposed methods seem to struggle with suboptimal performance due to (i) poor alignment between textual and visual representations, and (ii) ineffective temporal modeling. To address these challenges, we propose PE-CLIP, a parameter-efficient fine-tuning (PEFT) framework that elegantly adapts CLIP for dynamic facial expression recognition, requiring significantly reduced number of trainable parameters while maintaining high accuracy. At its core, to enhance efficiency and performance, PE-CLIP introduces two specialized adapters namely a Temporal Dynamic Adapter (TDA) and a Shared Adapter (ShA). The temporal dynamic adapter is a GRU-based module with a dynamic scaling mechanism, capturing sequential dependencies while adaptively modulating the contribution of each temporal feature to emphasize the most informative ones while mitigating irrelevant variations. The shared adapter is a lightweight adapter refine representations within both textual and visual encoders, ensuring consistent feature processing while maintaining parameter efficiency. Additionally, we leverage Multi-modal Prompt Learning (MaPLe), which introduces learnable prompts to both visual and action unit-based textual description inputs, further improving the semantic alignment between modalities and enabling the efficient adaptation of CLIP for dynamic tasks. We evaluate our proposed PE-CLIP on two benchmark datasets, namely DFEW, FERV39K, and AFEW, achieving competitive performance compared to state-of-the-art methods while requiring fewer trainable parameters. By striking an optimal balance between parameter efficiency and performance, PE-CLIP sets a new benchmark in resource-efficient DFER. The source code of the proposed PE-CLIP will be publicly available at https://github.com/Ibtissam-SAADI/PE-CLIP .

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Ibtissam Saadi

Abdenour Hadid

Douglas W. Cunningham

Journals

ACM Transactions on Multimedia Computing Communications and Applications

Actions

Institutions

Centre National de la Recherche Scientifique

Université de Lille

Brandenburg University of Technology Cottbus-Senftenberg

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

PE-CLIP: A Parameter-Efficient Fine-Tuning of Vision Language Models for Dynamic Facial Expression Recognition

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study