What question did this study set out to answer?

The aim is to understand the temporal dynamics of mid-level feature processing in visual perception and how these features relate to sensory and semantic processing.

March 28, 2026Open Access

Investigating the temporal dynamics and modelling of mid-level feature representations in humans

Key Points

The aim is to understand the temporal dynamics of mid-level feature processing in visual perception and how these features relate to sensory and semantic processing.
Used 3D-rendered images and videos with annotations for mid-level features.
Collected EEG responses during stimulus presentation.
Trained linearized encoding models to predict EEG responses.
Assessed CNNs for their capability to model mid-level feature processing.
Mid-level features were best represented between 100-250 ms post-stimulus.
Mid-level features play a bridging role between low- and high-level features.
CNNs showed a comparable processing order for mid-level features in videos, despite their shallower hierarchies.

Abstract

Abstract Visual perception unfolds through a hierarchy of transformations, beginning with the extraction of low-level features, such as edges, and culminating in the representation of high-level features like object categories. While the processing of low- and high-level features is well-studied, the intermediate transformations, i.e., mid-level features, remain poorly understood. Here, we introduce a stimulus set of naturalistic 3D-rendered images and videos with ground-truth annotations for five candidate mid-level features (reflectance, scene depth, world normals, lighting, and skeleton position) alongside one low-level feature (edges) and one high-level feature (action identity). To determine when these features are processed in the brain, we collected electroencephalography (EEG) responses during stimulus presentation and trained linearized encoding models to predict EEG responses from the annotations. We first showed that candidate mid-level features were best represented between ~100-250 ms post-stimulus, between low- and high-level features and consistent with a bridging role linking sensory and semantic processing. We then assessed convolutional neural networks (CNNs) as models of mid-level feature processing in humans and observed that although their hierarchies were shallower, they exhibited a comparable processing order for mid-level but not low- or high-level features, only for videos. Together, our results support the view that mid-level features are tied to surface- and shape-related processing and establish 3D-rendered stimuli with annotations as a valuable tool for investigating mid-level vision in biological and artificial neural networks.

Investigating the temporal dynamics and modelling of mid-level feature representations in humans

Key Points

Abstract

Cite This Study