March 3, 2026Open Access

VTFCGNet: a novel cross-modal reasoning network integrating Fourier self-attention and graph attention for visual text question answering

Key Points

VTFCGNet enhances visual question answering by integrating Fourier frequency and spatial domain self-attention.
A top accuracy of 75.83% was achieved on the VTQA test dataset, indicating significant improvement.
Cross-media reasoning network employs multi-step reasoning to capture intricate feature relationships effectively.
Results from grid-level and region-level visual features demonstrate the model's robustness and efficiency.

Abstract

Traditional visual question answering (VQA) tasks focus on surface image-text matching, while visual text question answering (VTQA) tasks require deeper cross-modal reasoning. Current Transformer-based models are insufficient in screening effective features. To address these issues, this paper proposes a new cross-media reasoning network (VTFCGNet) that integrates Fourier frequency domain and spatial domain self-attention and graph attention mechanisms. The network can adaptively weight the feature interactions between different modalities, achieve deep fusion of image, text, and question modalities, and overcome the limitations of existing models in VTQA tasks. VTFCGNet first extracts key entities based on the entity extraction network (VTFC-Net) in both the Fourier frequency domain and the spatial domain, thereby reducing the interference of redundant features compared to the traditional self-attention mechanism. Secondly, a cross-media reasoning network (CRG-Net) is employed for multi-step cross-media reasoning, significantly enhancing its ability to capture fine-grained features and model cross-modal relationships compared to traditional VQA models. Finally, comprehensive experiments on VTQA and VQA v2 datasets—using both grid-level and region-level visual features of region proposals—validate the outstanding performance of VTFCGNet. The findings demonstrate that VTFCGNet achieved top accuracies of 71.93% and 75.83% on the VQA v2 test-dev and VTQA test (English Version) datasets, respectively.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Yujie Huo

Weng Howe Chan

Song Yu

Journals

Neural Computing and Applications

Actions

Institutions

Central South University

University of Electronic Science and Technology of China

University of Technology Malaysia

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

VTFCGNet: a novel cross-modal reasoning network integrating Fourier self-attention and graph attention for visual text question answering

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study