We present a new descriptor for local representation of human actions. In contrast to state-of-the-art descriptors, which use spatio-temporal features to describe cuboids detected from video sequences, we propose to employ a 2D descriptor based on the Laplacian pyramid for efficiently encoding spatio-temporal regions of interest. Image templates including structural planes and motion templates, are firstly extracted from a cuboid to encode the structural and motion features. A 2D Laplacian pyramid is then performed to decompose each of those images into a series of sub-band feature maps, which is followed by a two-stage feature extraction, i.e., Gabor filtering and max pooling. Motion-related edge and orientation information is enhanced after the filtering. To capture more discriminative and invariant features, max pooling is applied to the outputs of Gabor filtering, between scales within filter banks and over spatial neighbors. The obtained local features associated with cuboids are fed to the localized soft-assignment coding with max pooling on the Bag-of-Words (BoWs) model to represent an action. The image templates, i.e., MHI and TOP, explicitly encode the motion and structure information in the video sequences and the proposed Laplacian pyramid coding descriptor provides an informative representation of them due to the multi-scale analysis. The employment of localized soft-assignment coding and max pooling gives a robust representation of actions. Experimental results on the benchmark KTH dataset and the newly released and challenging HMDB51 dataset demonstrate the effectiveness of the proposed method for human action recognition.