TY - JOUR
T1 - Embedding Motion and Structure Features for Action Recognition
AU - Zhen, Xiantong
AU - Shao, Ling
AU - Tao, Dacheng
AU - Li, Xuelong
PY - 2013/7
Y1 - 2013/7
N2 - We propose a novel method to model human actions by explicitly coding motion and structure features that are separately extracted from video sequences. Firstly, the motion template (one feature map) is applied to encode the motion information and image planes (five feature maps) are extracted from the volume of differences of frames to capture the structure information. The Gaussian pyramid and center-surround operations are performed on each of the six obtained feature maps, decomposing each feature map into a set of subband maps. Biologically inspired features are then extracted by successively applying Gabor filtering and max pooling on each subband map. To make a compact representation, discriminative locality alignment is employed to embed the high-dimensional features into a low-dimensional manifold space. In contrast to sparse representations based on detected interest points, which suffer from the loss of structure information, the proposed model takes into account the motion and structure information simultaneously and integrates them in a unified framework; it therefore provides an informative and compact representation of human actions. The proposed method is evaluated on the KTH, the multiview IXMAS, and the challenging UCF sports datasets and outperforms state-of-the-art techniques on action recognition.
AB - We propose a novel method to model human actions by explicitly coding motion and structure features that are separately extracted from video sequences. Firstly, the motion template (one feature map) is applied to encode the motion information and image planes (five feature maps) are extracted from the volume of differences of frames to capture the structure information. The Gaussian pyramid and center-surround operations are performed on each of the six obtained feature maps, decomposing each feature map into a set of subband maps. Biologically inspired features are then extracted by successively applying Gabor filtering and max pooling on each subband map. To make a compact representation, discriminative locality alignment is employed to embed the high-dimensional features into a low-dimensional manifold space. In contrast to sparse representations based on detected interest points, which suffer from the loss of structure information, the proposed model takes into account the motion and structure information simultaneously and integrates them in a unified framework; it therefore provides an informative and compact representation of human actions. The proposed method is evaluated on the KTH, the multiview IXMAS, and the challenging UCF sports datasets and outperforms state-of-the-art techniques on action recognition.
KW - Biologically inspired features
KW - discriminative locality alignment
KW - human action recognition
U2 - 10.1109/TCSVT.2013.2240916
DO - 10.1109/TCSVT.2013.2240916
M3 - Article
VL - 23
SP - 1182
EP - 1190
JO - IEEE Transactions on Circuits and Systems for Video Technology
JF - IEEE Transactions on Circuits and Systems for Video Technology
SN - 1051-8215
IS - 7
ER -