The accuracy and believability of crowd simulations underpins computational studies of human collective behaviour, with implications for urban design, policing, security and many other areas. Accuracy concerns the closeness of the fit between a simulation and observed data, and believability concerns the human perception of plausibility. In this paper, we address both issues via a so-called ‘Turing test’ for crowds, using movies generated from both accurate simulations and observations of real crowds. The fundamental question we ask is ‘Can human observers distinguish between real and simulated crowds?’ In two studies with student volunteers (n = 384 and n = 156), we find that non-specialist individuals are able to reliably distinguish between real and simulated crowds when they are presented side-by-side, but they are unable to accurately classify them. Classification performance improves slightly when crowds are presented individually, but not enough to out-perform random guessing. We find that untrained individuals have an idealized view of human crowd behaviour which is inconsistent with observations of real crowds. Our results suggest a possible framework for establishing a minimal set of collective behaviours that should be integrated into the next generation of crowd simulation models.