What question did this study set out to answer?

This work aims to explore pixel-level understanding models and their relation to biological neural systems for video processing.

April 15, 2026Open Access

Pixel-level understanding of a world in motion within a neural encoding framework

Key Points

This work aims to explore pixel-level understanding models and their relation to biological neural systems for video processing.
Investigated pixel-level models focusing on optical flow, depth, and semantic segmentation.
Analyzed the responses of the human visual cortex to video stimuli using these models.
Compared class-agnostic and class-aware tasks within a neural encoding framework.
Optical flow models better predict responses in early-to-mid visual cortex regions compared to high-level regions.
Depth estimation is effective across all cortical regions, encoding high-level cues despite low-level training.
Image-level models outperform pixel-level ones across most regions with the same backbone.
Class-aware models perform better in later cortical regions than class-agnostic ones, except for depth estimation.

Abstract

Extensive literature has drawn comparisons between recordings of the biological neurons in the brain and deep neural networks. Such comparative analysis can help us understand biological neural systems, and it can provide insights into the representations learned in deep neural networks. Recent studies examined the human brain’s responses when watching a world in motion through short video stimuli. However, these studies mainly focused on image- or video-level understanding using deep learning models trained for object and action recognition. To the best of our knowledge, there has been no study that has focused on pixel-level understanding models and their connection to biological neural systems with emphasis on video understanding. In this work, we investigate pixel-level understanding models that predict optical flow, depth, and semantic segmentation. We focus on how these models perform in the neural encoding of the human visual cortex responses to video stimuli. Moreover, we use this to study class-agnostic vs. class-aware tasks within a neural encoding framework. Our results show that optical flow models tend to predict the voxels’ responses of early-to-mid visual cortex regions better than high-level cortical regions. On the other hand, depth estimation performs well across all the regions, indicating that its learned representations encode high-level cues such as the notion of an object or face, despite not necessarily being trained with high-level semantics. Additionally, we show that image-level understanding models surpass pixel-level ones with the same backbone across most of the regions. Finally, we show that class-aware models tend to behave better in later cortical regions than class-agnostic ones, except for the latest depth estimation foundation model, which showed the intermediate representations performing well in these regions. These findings can inspire future research in the understanding of flow and depth processing in the brain and their mapping to deep neural networks, in addition to improving deep networks’ representational capabilities.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Mai Gamal

Mohamed Rashad

Eman Ehab

Journals

Scientific Reports

Actions

Institutions

University of British Columbia

Ain Shams University

American University in Cairo

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Pixel-level understanding of a world in motion within a neural encoding framework

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study