This paper addresses the multi-view action recognition problem with a local segment similarity voting scheme, upon which we build a novel multi-sensor fusion method. The recently proposed random forests classifier is used to map the local segment features to their corresponding prediction histograms. We compare the results of our approach with those of the baseline Bag-of-Words (BoW) and the Naïve–Bayes Nearest Neighbor (NBNN) methods on the multi-view IXMAS dataset. Additionally, comparisons between our multi-camera fusion strategy and the normally used early feature concatenating strategy are also carried out using different camera views and different segment scales. It is proven that the proposed sensor fusion technique, coupled with the random forests classifier, is effective for multiple view human action recognition.