This letter investigates the reconfigurable intelligent surface (RIS)-aided massive multiple-input multiple-output (MIMO) systems with a two-timescale design. First, the zero-forcing (ZF) detector is applied at the base station (BS) based on instantaneous aggregated channel state information (CSI), which is the superposition of the direct channel and the cascaded user-RIS-BS channel. Then, by leveraging the channel statistical property, we derive the closed-form ergodic achievable rate expression. Using a gradient ascent method, we design the RIS passive beamforming relying only on the long-term statistical CSI. We prove that the ergodic rate scales on the order of O(log2 (MN)), where M and N denote the number of BS antennas and RIS elements, respectively. We also prove the striking superiority of the considered RIS-aided system with ZF detectors over the RIS-free systems and RIS-aided systems with maximum-ratio combining (MRC).