Traditional visual question answering (VQA) tasks focus on surface image-text matching, while visual text question answering (VTQA) tasks require deeper cross-modal reasoning. Current Transformer-based models are insufficient in screening effective features. To address these issues, this paper proposes a new cross-media reasoning network (VTFCGNet) that integrates Fourier frequency domain and spatial domain self-attention and graph attention mechanisms. The network can adaptively weight the feature interactions between different modalities, achieve deep fusion of image, text, and question modalities, and overcome the limitations of existing models in VTQA tasks. VTFCGNet first extracts key entities based on the entity extraction network (VTFC-Net) in both the Fourier frequency domain and the spatial domain, thereby reducing the interference of redundant features compared to the traditional self-attention mechanism. Secondly, a cross-media reasoning network (CRG-Net) is employed for multi-step cross-media reasoning, significantly enhancing its ability to capture fine-grained features and model cross-modal relationships compared to traditional VQA models. Finally, comprehensive experiments on VTQA and VQA v2 datasets—using both grid-level and region-level visual features of region proposals—validate the outstanding performance of VTFCGNet. The findings demonstrate that VTFCGNet achieved top accuracies of 71.93% and 75.83% on the VQA v2 test-dev and VTQA test (English Version) datasets, respectively.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yujie Huo
Weng Howe Chan
Song Yu
Neural Computing and Applications
Central South University
University of Electronic Science and Technology of China
University of Technology Malaysia
Building similarity graph...
Analyzing shared references across papers
Loading...
Huo et al. (Sun,) studied this question.
www.synapsesocial.com/papers/69a7660bbadf0bb9e87db73a — DOI: https://doi.org/10.1007/s00521-025-11721-5