Combining vision and language comprehension is critical for educational applications, especially for interpreting visual content and answering questions about it. The goal of educational visual question answering (Edu-VQA) systems is to support learners by providing educated answers to questions regarding images, diagrams, illustrations, and charts, thereby augmenting interactivity in learning environments. However, most current models encounter challenges when there is not good alignment between either visual regions and textual questions, resulting in incomplete or incorrect answers. Many models implement a purely global representation of an image without fine-grain cross-modal interactions required for interpreting complex educational content. In this paper, we develop an Attention-Guided Vision-Language Transformer (AG-VLT) framework to confront these significant challenges. AG-VLT uses dual encoders to embed images and textual questions into a shared embedding space, employing a cross-modal attention mechanism to selectively attend to relevant regions of the image in response to tokens in a question. Multi-layer transformer modules further refine these embeddings so that the model can learn the complex relationships between visual comprehension and reasoning with language. Our experimental results demonstrate that AG-VLT outperforms current methods in terms of accuracy as well as interpretability, bridging a gap between visual comprehension and reasoning with language in educational applications.
Building similarity graph...
Analyzing shared references across papers
Loading...
Roohee Khan
Sapna Bawankar
Procedia Computer Science
Kalinga University
Building similarity graph...
Analyzing shared references across papers
Loading...
Khan et al. (Thu,) studied this question.
www.synapsesocial.com/papers/69c0df0bfddb9876e79c152d — DOI: https://doi.org/10.1016/j.procs.2026.01.013