What question did this study set out to answer?

The research aims to develop an efficient model for sign language recognition suited for low-power microcontrollers.

April 10, 2026Open Access

Efficient Word-Level Sign Language Recognition Using Quantized Spatiotemporal Deep Learning for Low-Power Microcontrollers

Key Points

The research aims to develop an efficient model for sign language recognition suited for low-power microcontrollers.
Introduced S3D-Conv1D architecture for isolated word-level recognition
Utilized separable spatiotemporal processing tailored for microcontroller constraints
Implemented INT8 support for all operators using TensorFlow Lite
Evaluated accuracy across multiple datasets, including WLASL100 and SemLex100
Conducted quantization-aware training for improved model stability
Achieved 98.96% float32 accuracy on WLASL100 with stable 82.5% generalization on SemLex100
Post-quantization, maintained 98.7% accuracy while reducing model size to 883 KB
Ultralight variant compressed to 24.7 KB with 98.5% accuracy on WLASL100
S3D-Conv1D retains real-time performance under 1 MB footprint, outperforming other architectures in efficiency

Abstract

Deploying efficient sign language recognition models on edge devices advances inclusive, affordable, and privacy-preserving human–computer interaction. Yet most state-of-the-art architectures target server-class hardware and fail under the strict memory, computation, and energy constraints of microcontrollers. This work introduces S3D-Conv1D, a separable spatiotemporal architecture for isolated word-level sign language recognition, tailored for TinyML deployment. While the idea of separating spatial and temporal processing has been explored in earlier models, the novelty here lies in a deployment pipeline designed from the outset for microcontroller-class constraints: every operator has native INT8 support in TensorFlow Lite, CMSIS-NN, and NNoM; the architecture achieves full integer-only execution with competitive accuracy; and the evaluation scale (100 and 300 classes) substantially exceeds prior TinyML sign language recognition studies. Evaluations on datasets show that S3D-Conv1D achieves 98.96% float32 accuracy on WLASL100 with stable cross-dataset generalization (82.5% on SemLex100). After INT8 quantization, accuracy remains high (98.7% on WLASL100) while compressing to 883 KB, the smallest across all evaluated architectures. An ultralight variant further reduces size to 24.7 KB while sustaining 98.5% accuracy on WLASL100 and 77.2% on WLASL300. Quantization-aware training improves stability, particularly at larger vocabulary scales. Among baselines, S3D achieves strong performances but negligible compression (30.3 MB) due to non-quantization-friendly operators. The MobileNet variant generalizes better with 99.4% on WLASL100 and 97.6% accuracy on SemLex100 but remains large at 2.71 MB in INT8 form. CNN + RNN and e-LSTM depend on unsupported recurrent or attention operators. In contrast, S3D-Conv1D meets all operator compatibility requirements, delivers full INT8 execution with a compact sub-1 MB footprint, and real-time performance. These results demonstrate that competitive word-level sign language recognition is achievable under embedded constraints when architectural design prioritizes quantization stability, operator compatibility, and deployment feasibility from the outset.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Samuel Longwani Kimpinde

Peter Olukanmi

Journals

Algorithms

Actions

Institutions

University of Johannesburg

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Efficient Word-Level Sign Language Recognition Using Quantized Spatiotemporal Deep Learning for Low-Power Microcontrollers

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider