July 29, 2024Open Access

Efficient Spatio-Temporal Network for Action Recognition

Key Points

Key points are not available for this paper at this time.

Abstract

Abstract The input tensor of video data includes temporal, spatial, and channel dimensions, crucial for extracting complementary spatial, temporal, and spatio-temporal features for video action recognition. To efficiently extract and integrate these features, we propose an Efficient Spatio-Temporal Module (ESTM) with three pathways dedicated to extracting spatial, temporal, and spatio-temporal features. Each pathway uses the Cross Global Average Pooling (CGAP) module to compress the current dimension, focusing features on the remaining two dimensions. This enhances feature extraction and recognition rates for complex actions. We also introduce a Motion Excitation Module (MEM) to enrich input features by transforming correlations between adjacent frames, reducing computational complexity. Finally, ESTM and MEM are seamlessly integrated into a 2D CNN, forming the Efficient Spatio-Temporal Network (ESTN), with minimal impact on network parameters and computational costs. Extensive experiments show that ESTN outperforms state-of-the-art methods on datasets like Something V1 & V2 and HMDB51, validating its effectiveness.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Su et al. (Mon,) studied this question.

www.synapsesocial.com/papers/68e5eb3bb6db6435875803ce — DOI: https://doi.org/10.21203/rs.3.rs-4679346/v1

Authors

Yanxiong Su

Qian Zhao

Actions

Institutions

Shanghai University of Electric Power

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Efficient Spatio-Temporal Network for Action Recognition

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion