Extensive literature has drawn comparisons between recordings of the biological neurons in the brain and deep neural networks. Such comparative analysis can help us understand biological neural systems, and it can provide insights into the representations learned in deep neural networks. Recent studies examined the human brain’s responses when watching a world in motion through short video stimuli. However, these studies mainly focused on image- or video-level understanding using deep learning models trained for object and action recognition. To the best of our knowledge, there has been no study that has focused on pixel-level understanding models and their connection to biological neural systems with emphasis on video understanding. In this work, we investigate pixel-level understanding models that predict optical flow, depth, and semantic segmentation. We focus on how these models perform in the neural encoding of the human visual cortex responses to video stimuli. Moreover, we use this to study class-agnostic vs. class-aware tasks within a neural encoding framework. Our results show that optical flow models tend to predict the voxels’ responses of early-to-mid visual cortex regions better than high-level cortical regions. On the other hand, depth estimation performs well across all the regions, indicating that its learned representations encode high-level cues such as the notion of an object or face, despite not necessarily being trained with high-level semantics. Additionally, we show that image-level understanding models surpass pixel-level ones with the same backbone across most of the regions. Finally, we show that class-aware models tend to behave better in later cortical regions than class-agnostic ones, except for the latest depth estimation foundation model, which showed the intermediate representations performing well in these regions. These findings can inspire future research in the understanding of flow and depth processing in the brain and their mapping to deep neural networks, in addition to improving deep networks’ representational capabilities.
Building similarity graph...
Analyzing shared references across papers
Loading...
Mai Gamal
Mohamed Rashad
Eman Ehab
Scientific Reports
University of British Columbia
Ain Shams University
American University in Cairo
Building similarity graph...
Analyzing shared references across papers
Loading...
Gamal et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69df2c01e4eeef8a2a6b104c — DOI: https://doi.org/10.1038/s41598-025-34141-w