Humans actively move their eyes to engage with visual stimuli during complex daily activities, such as reading and scene viewing. These eye movements serve as a valuable window into the mind and brain, reflecting the cognitive processes unfolding as individuals comprehend visual stimuli. They provide insights into what individuals focus on, how they process and understand information, and even what they might be thinking or feeling. Consequently, eye movements have been extensively studied in cognitive psychology to understand the underlying cognitive processes during visual interaction. More recently, they have gained attention in computer science, where they are leveraged to enhance various technological applications, such as making machine-learning-based language models (LMs) exhibit more human-like linguistic behavior. In this thesis, we address two key challenges that hinder progress in eye movement research and advance research in three critical applications: attention-deficit/hyperactivity disorder (ADHD) detection based on eye movements, human scanpath modeling during reading, and leveraging eye movements for natural language processing (NLP). The first challenge lies in cross-modal sequential alignment and encoding. Eye movement sequences are informative but highly complex, making their analysis and interpretation challenging. This complexity is further heightened when considering their intricate interactions with various types of stimuli. Traditional approaches often rely on aggregating eye movement features, such as aggregating eye movement events over longer periods, resulting in measures such as mean fixation duration or saccade peak velocity per second, or deriving so-called reading measures for each word of the linguistic stimulus, such as total fixation duration or regression probability. While these aggregations simplify analysis, they risk discarding valuable sequential information inherent in eye movements, and their dynamic interactions with the underlying stimuli. To address this, we explore the advantages of embedding both the sequential dynamics of eye movements and their associated stimuli into deep learning-based sequence models. Our frameworks ensure effective alignment and integration across modalities, demonstrated through applications including ADHD detection during natural viewing using video stimuli and human scanpath modeling during reading with textual stimuli. Our findings demonstrate the performance potential of neural networks in processing unaggregated eye movements together with sequential stimuli, surpassing traditional approaches with aggregated features and achieving state-of-the-art performance across these tasks. The second challenge stems from the scarcity of eye movement data. Collecting high-quality eye movement data is a resource-intensive endeavor. Hence, human gaze data is scarce. This scarcity presents challenges in developing high-capacity machine learning models and renders the use of gaze data for input stimuli at deployment time unrealistic for most use cases. We tackle this challenge with two approaches: (i) In ADHD detection, where available datasets are typically limited to hundreds of samples, we explore the possibility of adopting transfer learning techniques to maximize data utility. Specifically, we pre-train the model on a related task with a larger amount of labeled data and then fine-tune it on the target setting. Our findings show that this approach yields better performance than training the model from scratch on the target task; (ii) In the context of gaze-augmented NLP models, while NLP tasks often have access to abundant text corpora, gaze data remains scarce. To bridge this gap, we investigate the potential of leveraging simulated gaze data generated by a gaze modeling model to enhance NLP task performance. Our experiments reveal that our model achieves a performance that is comparable to an LM augmented with real human gaze data, showcasing the practical utility of synthetic gaze data for NLP applications. Building on this, we evaluate our synthetic gaze augmented/supervised LMs across a broad range of NLP tasks and datasets, including those involving extensive text corpora. Our findings highlight that even in the era of highly capable large LMs, gaze data remains a valuable resource for enhancing LMs or enriching textual representation, particularly in low-resource settings and potentially low-resource languages.
Building similarity graph...
Analyzing shared references across papers
Loading...
Shuwen Deng
Building similarity graph...
Analyzing shared references across papers
Loading...
Shuwen Deng (Thu,) studied this question.
www.synapsesocial.com/papers/69df2c9ee4eeef8a2a6b1ccb — DOI: https://doi.org/10.25932/publishup-70109