What question did this study set out to answer?

The research aims to improve understanding of eye movements and their applications in cognitive tasks and NLP using deep learning.

April 15, 2026Open Access

Advances in deep learning-based eye movement modelling: Multimodal alignment in natural language understanding and cognitive screening

Key Points

The research aims to improve understanding of eye movements and their applications in cognitive tasks and NLP using deep learning.
Developed deep learning models for analyzing eye movements in various contexts.
Utilized transfer learning to enhance performance in ADHD detection with limited gaze data.
Investigated simulated gaze data to strengthen NLP task outcomes.
Neural networks using unaggregated eye movements achieved superior results compared to traditional methods.
Transfer learning improved ADHD detection performance using limited datasets.
Synthetic gaze data provided results comparable to real gaze data in NLP applications.

Abstract

Humans actively move their eyes to engage with visual stimuli during complex daily activities, such as reading and scene viewing. These eye movements serve as a valuable window into the mind and brain, reflecting the cognitive processes unfolding as individuals comprehend visual stimuli. They provide insights into what individuals focus on, how they process and understand information, and even what they might be thinking or feeling. Consequently, eye movements have been extensively studied in cognitive psychology to understand the underlying cognitive processes during visual interaction. More recently, they have gained attention in computer science, where they are leveraged to enhance various technological applications, such as making machine-learning-based language models (LMs) exhibit more human-like linguistic behavior. In this thesis, we address two key challenges that hinder progress in eye movement research and advance research in three critical applications: attention-deficit/hyperactivity disorder (ADHD) detection based on eye movements, human scanpath modeling during reading, and leveraging eye movements for natural language processing (NLP). The first challenge lies in cross-modal sequential alignment and encoding. Eye movement sequences are informative but highly complex, making their analysis and interpretation challenging. This complexity is further heightened when considering their intricate interactions with various types of stimuli. Traditional approaches often rely on aggregating eye movement features, such as aggregating eye movement events over longer periods, resulting in measures such as mean fixation duration or saccade peak velocity per second, or deriving so-called reading measures for each word of the linguistic stimulus, such as total fixation duration or regression probability. While these aggregations simplify analysis, they risk discarding valuable sequential information inherent in eye movements, and their dynamic interactions with the underlying stimuli. To address this, we explore the advantages of embedding both the sequential dynamics of eye movements and their associated stimuli into deep learning-based sequence models. Our frameworks ensure effective alignment and integration across modalities, demonstrated through applications including ADHD detection during natural viewing using video stimuli and human scanpath modeling during reading with textual stimuli. Our findings demonstrate the performance potential of neural networks in processing unaggregated eye movements together with sequential stimuli, surpassing traditional approaches with aggregated features and achieving state-of-the-art performance across these tasks. The second challenge stems from the scarcity of eye movement data. Collecting high-quality eye movement data is a resource-intensive endeavor. Hence, human gaze data is scarce. This scarcity presents challenges in developing high-capacity machine learning models and renders the use of gaze data for input stimuli at deployment time unrealistic for most use cases. We tackle this challenge with two approaches: (i) In ADHD detection, where available datasets are typically limited to hundreds of samples, we explore the possibility of adopting transfer learning techniques to maximize data utility. Specifically, we pre-train the model on a related task with a larger amount of labeled data and then fine-tune it on the target setting. Our findings show that this approach yields better performance than training the model from scratch on the target task; (ii) In the context of gaze-augmented NLP models, while NLP tasks often have access to abundant text corpora, gaze data remains scarce. To bridge this gap, we investigate the potential of leveraging simulated gaze data generated by a gaze modeling model to enhance NLP task performance. Our experiments reveal that our model achieves a performance that is comparable to an LM augmented with real human gaze data, showcasing the practical utility of synthetic gaze data for NLP applications. Building on this, we evaluate our synthetic gaze augmented/supervised LMs across a broad range of NLP tasks and datasets, including those involving extensive text corpora. Our findings highlight that even in the era of highly capable large LMs, gaze data remains a valuable resource for enhancing LMs or enriching textual representation, particularly in low-resource settings and potentially low-resource languages.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Shuwen Deng

Actions

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Advances in deep learning-based eye movement modelling: Multimodal alignment in natural language understanding and cognitive screening

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study