What question did this study set out to answer?

This research aims to improve Arabic sign language recognition by integrating visual and textual information.

March 8, 2026Open Access

CLIP-ArASL: A Lightweight Multimodal Model for Arabic Sign Language Recognition

Key Points

This research aims to improve Arabic sign language recognition by integrating visual and textual information.
Introduced the CLIP-ArASL model combining visual and linguistic features.
Utilized an EfficientNet-B0 image encoder and a MiniLM text encoder.
Employed a hybrid objective with contrastive and cross-entropy losses.
Evaluated the model on two datasets: ArASL2018 and ArASL21L.
Achieved 99.25±0.14% accuracy on ArASL2018 under supervised evaluation.
Achieved 91.51±1.29% accuracy on ArASL21L under supervised evaluation.
Under zero-shot conditions, reached accuracies of 55.2±12.15% on ArASL2018.
Zero-shot performance on ArASL21L yielded 37.6±9.07% accuracy, suggesting effective generalization.

Abstract

Arabic sign language (ArASL) is the primary communication medium for Deaf and hard-of-hearing people across Arabic-speaking communities. Most current ArASL recognition systems are based solely on visual features and do not incorporate linguistic or semantic information that could improve generalization and semantic grounding. This paper introduces CLIP-ArASL, a lightweight CLIP-style multimodal approach for static ArASL letter recognition that aligns visual hand gestures with bilingual textual descriptions. The approach integrates an EfficientNet-B0 image encoder with a MiniLM text encoder to learn a shared embedding space using a hybrid objective that combines contrastive and cross-entropy losses. This design supports supervised classification on seen classes and zero-shot prediction on unseen classes using textual class representations. The proposed approach is evaluated on two public datasets, ArASL2018 and ArASL21L. Under supervised evaluation, recognition accuracies of 99.25±0.14% and 91.51±1.29% are achieved, respectively. Zero-shot performance is assessed by withholding 20% of gesture classes during training and predicting them using only their textual descriptions. In this setting, accuracies of 55.2±12.15% on ArASL2018 and 37.6±9.07% on ArASL21L are obtained. These results show that multimodal vision–language alignment supports semantic transfer and enables recognition of unseen classes.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Naif Alasmari

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

CLIP-ArASL: A Lightweight Multimodal Model for Arabic Sign Language Recognition

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study