What question did this study set out to answer?

The aim is to explore the relationship between human gaze scanpaths and fine-grained task descriptions using GTANet.

May 6, 2026

Learning Alignments of Human Gaze and Fine-grained Task Descriptions

Key Points

The aim is to explore the relationship between human gaze scanpaths and fine-grained task descriptions using GTANet.
Introduced GTANet for aligning gaze features and task descriptions.
Developed a patch-based gaze encoder to capture spatiotemporal gaze features.
Validated the approach using gaze-to-question and question-to-gaze retrieval tasks on AiR and MHUG datasets.
GTANet outperforms baseline methods in all Recall@K metrics.
Achieved substantial improvements in both retrieval directions, confirming the strong link between gaze and task descriptions.

Abstract

We propose GTANet – a novel approach to learning the alignments between human gaze scanpaths and fine-grained task descriptions in vision-language tasks. While the influence of tasks on gaze is well known, the relationship between gaze scanpaths and fine-grained task descriptions remains largely unexplored. GTANet addresses this gap by aligning encoded spatiotemporal gaze features with text descriptions. We utilize a patch-based gaze encoder to generate gaze features that reflect visual contexts, and a multimodal feature mixer to fuse the gaze features and the task descriptions, capturing cross-modal alignment. To validate our method, we introduce two novel tasks: gaze-to-question and question-to-gaze retrieval. Experiments on the AiR and MHUG datasets demonstrate that GTANet consistently outperforms baseline methods across all Recall@K metrics, achieving substantial improvements in both retrieval directions. These results confirm the strong link between human gaze and fine-grained task descriptions, thus validating the effectiveness of our approach.

Bookmark

Cite This Study

Nishiyasu et al. (Fri,) studied this question.

synapsesocial.com/papers/69fadaab03f892aec9b1e5f6 https://doi.org/https://doi.org/10.1145/3803535

Bookmark