Comparing simulations of key warm periods in Earth history with contemporaneous geological proxy data is a useful approach for evaluating the ability of climate models to simulate warm, high-CO2 climates that are unprecedented in the more recent past. Here we use a global data set of confidence-assessed, proxy-based temperature estimates and biome reconstructions to assess the ability of eight models to simulate warm terrestrial climates of the Pliocene epoch. The Late Pliocene, 3.6–2.6 million years ago, is an accessible geological interval to understand climate processes of a warmer world. We show that model-predicted surface air temperatures reveal a substantial cold bias in the Northern Hemisphere. Particularly strong data–model mismatches in mean annual temperatures (up to 18 °C) exist in northern Russia. Our model sensitivity tests identify insufficient temporal constraints hampering the accurate configuration of model boundary conditions as an important factor impacting on data–model discrepancies. We conclude that to allow a more robust evaluation of the ability of present climate models to predict warm climates, future Pliocene data–model comparison studies should focus on orbitally defined time slices.