Abstract Visual perception unfolds through a hierarchy of transformations, beginning with the extraction of low-level features, such as edges, and culminating in the representation of high-level features like object categories. While the processing of low- and high-level features is well-studied, the intermediate transformations, i.e., mid-level features, remain poorly understood. Here, we introduce a stimulus set of naturalistic 3D-rendered images and videos with ground-truth annotations for five candidate mid-level features (reflectance, scene depth, world normals, lighting, and skeleton position) alongside one low-level feature (edges) and one high-level feature (action identity). To determine when these features are processed in the brain, we collected electroencephalography (EEG) responses during stimulus presentation and trained linearized encoding models to predict EEG responses from the annotations. We first showed that candidate mid-level features were best represented between ~100-250 ms post-stimulus, between low- and high-level features and consistent with a bridging role linking sensory and semantic processing. We then assessed convolutional neural networks (CNNs) as models of mid-level feature processing in humans and observed that although their hierarchies were shallower, they exhibited a comparable processing order for mid-level but not low- or high-level features, only for videos. Together, our results support the view that mid-level features are tied to surface- and shape-related processing and establish 3D-rendered stimuli with annotations as a valuable tool for investigating mid-level vision in biological and artificial neural networks.
Karapetian et al. (Thu,) studied this question.