The problems with the large complexity, low accuracy, and slow real-time performance of the current video crowd counting techniques in network models are addressed by a lightweight solution based on spatial shuffling and chain residual augmentation. The proposed model consists of an encoder, decoder, and prediction network. In the encoder section, firstly, a multi-scale deep separable reverse residual block is designed to extract crowd features of different resolutions and temporal feature information between adjacent frames, thereby improving the lightweight of the model. Then, a spatial shuffling module is proposed to be embedded in the coding backbone network to enhance the ability to extract features of people at different scales. Next, to reduce the loss of detail features in the decoder section, enhance the fusion module and chain residual module to combine the various resolution characteristics that the encoder produces layer by layer. Finally, by predicting the output through the decoder, a regression population density map is obtained, and the counting result is output by summing the density map pixel by pixel. The method proposed in this paper was compared on population datasets such as Mall, UCSD, FDST, and ShanghaiTech. The results showed that the model outperformed the comparison algorithm in terms of detection frame rate and parameter quantity. For example, on the Mall dataset, compared to the ConvLSTM population counting algorithm, the error values of mean absolute error (MAE) and mean square error (MSE) in this method were reduced by 43.75% and 72.71%, respectively, showing higher accuracy and real-time performance for crowd counting in different scene videos.