Human action recognition is one of the most challenging tasks in machine intelligence societies. It is important to extract discriminative spatial-temporal features to learn action representation. However, the discriminative information of videos is usually sparse and mixed with a large amount of redundant and interference information, which results in poor performance and recognition failure. Spatial temporal Attention modules enable the network to learn discriminative feature representation of different human actions. One critical key issue which is often missed in the design of these modules is visual tempo of actions. Since a video is formed by a set of spatial changes over time, in this paper, a visual tempo based spatial-temporal attention mechanism is proposed which helps to focus the model on the most meaningful changes in space and time. The proposed attention module is able to flexibly integrated into recurrent networks in a plug-and-play manner. Experimental results on UCF101, HMDB51, and Kinetics-400 demonstrate that the proposed model achieves superior performance among RCNN-based architectures and remains highly competitive with recent state-of-the-art methods, effectively balancing high accuracy with computational efficiency.
Building similarity graph...
Analyzing shared references across papers
Loading...
Maryam Koohzadi
Nasrollah Moghadam Charkari
Foad Ghaderi
Discover Artificial Intelligence
K.N.Toosi University of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Koohzadi et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69fd7d94bfa21ec5bbf05e7b — DOI: https://doi.org/10.1007/s44163-026-01119-0