Recognizing actions from a monocular video is a very hot topic in computer vision recently. In this paper, we propose a new representation of actions, the co-occurrence matrices descriptor, on the intrinsic shape manifold learned by graph embedding. The co-occurrence matrices descriptor captures more temporal information than the bag of words (histogram) descriptor which only considers the spatial information, thus boosting the classification accuracy. In addition, we compare the performance of the co-occurrence matrices descriptor on different manifolds learned by various graph-embedding methods. Graph-embedding methods preserve as much of the significant structure of the high-dimensional data as possible in the low-dimensional map. The results show that nonlinear algorithms are more robust than linear ones. Furthermore, we conclude that the label information plays a critical role in learning more discriminating manifolds.