Crowd Density Estimation

Guru Prasad Natarajan
Published in
3 min readFeb 12, 2019



In the light of problems caused due to poor crowd management, such as crowd crushes and blockages, there is an increasing need for computational models which can analyze highly dense crowds using video feeds from surveillance cameras. Crowd counting is a crucial component of such an automated crowd analysis system. This involves estimating the number of people in the crowd, as well as the distribution of the crowd density over the entire area of the gathering. Identifying regions with crowd density above the safety limit can help in issuing prior warnings and can prevent potential crowd crushes. Estimating the crowd count also helps in quantifying the significance of the event and better handling of logistics and infrastructure for the gathering.


We utilize a deep convolutional network for estimating the crowd density as well as the crowd count from still images captured from live stream cameras from a variety of angles. Instead of hand-crafted image features such as SIFT or HoG we rely on features learned using the fully convolutional neural networks (CNN). Scale variation in the images can be tackled using a combination of shallow and deep convolutional neural networks.

Architecture overview:

Architecture for crowd counting

Crowd images are often captured from varying view points, resulting in a wide variety of perspectives and scale variations. People near the camera are often captured in a great level of detail i.e., their faces and at times their entire body is captured. However, in the case of people away from camera or when images are captured from an aerial viewpoint, each person is represented only as a head blob. Efficient detection of people in both these scenarios requires the model to simultaneously operate at a highly semantic level (faces/body detectors) while also recognizing the low-level head blob patterns.

An overview of the proposed architecture is shown above. Let’s see what the individual components do in brief:

Deep Network:

Deep network captures the desired high-level semantics required for crowd counting using an architectural design similar to the well-known VGG-16 network which has been widely used for image classification and recognition tasks. Although the VGG-16 architecture was originally trained for the purpose of object classification, we effectively fine-tune its filters for the problem of crowd counting.

Shallow Network:

We aim to recognize the low-level head blob patterns, arising from people away from the camera, using a shallow convolutional network. Since blob detection does not require the capture of high-level semantics, we design this network to be shallow. This shallow network is primarily used for the detection of small head-blobs.

Crowd density prediction:

We concatenate the predictions of both the models and apply a bilinear interpolation to obtain the final crowd density prediction.

We use Mean Absolute Error (MAE) to quantify the performance of our method. MAE computes the mean of the absolute difference between the actual count and the predicted count for all the images in the dataset.


This model will find its application in use cases where ever automated techniques for monitoring crowds such as estimating a crowd’s density, tracking a crowd’s movement and observing a crowd’s behavior is necessary.