Key points are not available for this paper at this time.
Abstract The input tensor of video data includes temporal, spatial, and channel dimensions, crucial for extracting complementary spatial, temporal, and spatio-temporal features for video action recognition. To efficiently extract and integrate these features, we propose an Efficient Spatio-Temporal Module (ESTM) with three pathways dedicated to extracting spatial, temporal, and spatio-temporal features. Each pathway uses the Cross Global Average Pooling (CGAP) module to compress the current dimension, focusing features on the remaining two dimensions. This enhances feature extraction and recognition rates for complex actions. We also introduce a Motion Excitation Module (MEM) to enrich input features by transforming correlations between adjacent frames, reducing computational complexity. Finally, ESTM and MEM are seamlessly integrated into a 2D CNN, forming the Efficient Spatio-Temporal Network (ESTN), with minimal impact on network parameters and computational costs. Extensive experiments show that ESTN outperforms state-of-the-art methods on datasets like Something V1 & V2 and HMDB51, validating its effectiveness.
Building similarity graph...
Analyzing shared references across papers
Loading...
Su et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e5eb3bb6db6435875803ce — DOI: https://doi.org/10.21203/rs.3.rs-4679346/v1
Yanxiong Su
Qian Zhao
Shanghai University of Electric Power
Building similarity graph...
Analyzing shared references across papers
Loading...