What question did this study set out to answer?

The study aims to improve educational visual question answering (Edu-VQA) by enhancing visual and language comprehension.

March 23, 2026Open Access

Attention-Enhanced Vision-Language Framework for Educational Visual Question Answering

Key Points

The study aims to improve educational visual question answering (Edu-VQA) by enhancing visual and language comprehension.
Developed Attention-Guided Vision-Language Transformer (AG-VLT) framework.
Utilized dual encoders for embedding images and textual questions into a shared space.
Implemented cross-modal attention to focus on relevant image regions based on question tokens.
Employed multi-layer transformer modules for refining embeddings.
AG-VLT outperformed existing models in accuracy and interpretability.
Demonstrated improved alignment between visual comprehension and language reasoning in educational contexts.

Abstract

Combining vision and language comprehension is critical for educational applications, especially for interpreting visual content and answering questions about it. The goal of educational visual question answering (Edu-VQA) systems is to support learners by providing educated answers to questions regarding images, diagrams, illustrations, and charts, thereby augmenting interactivity in learning environments. However, most current models encounter challenges when there is not good alignment between either visual regions and textual questions, resulting in incomplete or incorrect answers. Many models implement a purely global representation of an image without fine-grain cross-modal interactions required for interpreting complex educational content. In this paper, we develop an Attention-Guided Vision-Language Transformer (AG-VLT) framework to confront these significant challenges. AG-VLT uses dual encoders to embed images and textual questions into a shared embedding space, employing a cross-modal attention mechanism to selectively attend to relevant regions of the image in response to tokens in a question. Multi-layer transformer modules further refine these embeddings so that the model can learn the complex relationships between visual comprehension and reasoning with language. Our experimental results demonstrate that AG-VLT outperforms current methods in terms of accuracy as well as interpretability, bridging a gap between visual comprehension and reasoning with language in educational applications.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Roohee Khan

Sapna Bawankar

Journals

Procedia Computer Science

Actions

Institutions

Kalinga University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Attention-Enhanced Vision-Language Framework for Educational Visual Question Answering

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study