March 3, 2026Open Access

Vision-tactile guided text generation using a lightweight transformer decoder for enhancing accessibility of the visually impaired

Key Points

Accurate texture recognition improves cognitive awareness and usability for the visually impaired, reducing cognitive burden.
The lightweight transformer decoder model generates short descriptive keywords for faster interpretation and feedback.
Evaluation on the touch-vision-language dataset shows superior performance in contextual text generation and classification.
The model addresses data imbalance issues, enabling a more scalable solution for real-time assistive interaction.

Abstract

Assistive technologies play an essential role for people with visual impairment to encourage independence and enhance the quality of life. Despite the artificial intelligence (AI) advancements, the current smart assistive systems still have potential limitations that restrict their usefulness in practical settings. Assistive technologies are ineffective in delivering real-time, context-aware environmental understanding because of inadequate integration of visual, tactile, and linguistic cues. These challenges hinder cognitive awareness, delay feedback generation, and limit deployment on lightweight platforms. Notably, large complex models require high computation, which potentially impacts their usability, affordability, and processing speed. The primary objective of this study is to convey essential visual-tactile information by a short descriptive keyword, not to describe the entire scene, to support the visually impaired in understanding material properties quickly. This study designs a lightweight transformer decoder-based text generation (TDTG) model that fuses tactile and visual signals for texture text generation, enabling accurate texture recognition without the need for large models by generating a short descriptive keyword. This short, precise output is much simpler to interpret through audio feedback, thereby enhancing usability and reducing cognitive burden. In order to mitigate data imbalance and enhance generalization, a class-specific deep convolutional generative adversarial network augments underrepresented texture categories. The TDTG framework evaluation on the touch-vision-language (TVL) dataset for generation and classification capabilities with existing models. It attains a superior balance between contextual text quality, lightweight architecture, and multimodal adaptability, providing a practical and scalable solution for real-time assistive interaction and environmental awareness enhancement for the visually impaired.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Raniyah Wazirali

Journals

Complex & Intelligent Systems

SHILAP Revista de lepidopterología

Actions

Institutions

Saudi Electronic University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Vision-tactile guided text generation using a lightweight transformer decoder for enhancing accessibility of the visually impaired

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study