Arabic sign language (ArASL) is the primary communication medium for Deaf and hard-of-hearing people across Arabic-speaking communities. Most current ArASL recognition systems are based solely on visual features and do not incorporate linguistic or semantic information that could improve generalization and semantic grounding. This paper introduces CLIP-ArASL, a lightweight CLIP-style multimodal approach for static ArASL letter recognition that aligns visual hand gestures with bilingual textual descriptions. The approach integrates an EfficientNet-B0 image encoder with a MiniLM text encoder to learn a shared embedding space using a hybrid objective that combines contrastive and cross-entropy losses. This design supports supervised classification on seen classes and zero-shot prediction on unseen classes using textual class representations. The proposed approach is evaluated on two public datasets, ArASL2018 and ArASL21L. Under supervised evaluation, recognition accuracies of 99.25±0.14% and 91.51±1.29% are achieved, respectively. Zero-shot performance is assessed by withholding 20% of gesture classes during training and predicting them using only their textual descriptions. In this setting, accuracies of 55.2±12.15% on ArASL2018 and 37.6±9.07% on ArASL21L are obtained. These results show that multimodal vision–language alignment supports semantic transfer and enables recognition of unseen classes.
Building similarity graph...
Analyzing shared references across papers
Loading...
Naif Alasmari
Building similarity graph...
Analyzing shared references across papers
Loading...
Naif Alasmari (Sat,) studied this question.
www.synapsesocial.com/papers/69ada8a1bc08abd80d5bbd65 — DOI: https://doi.org/10.3390/app16052573