Introduction to Crowd Density Estimation

Published in

Analytics Vidhya

7 min readJun 10, 2020

Reproduced from the original post

In this post, we are going to build an object counting model based on simple network architecture. Although we use the crowd dataset here, a similar solution can be applied to the rather more useful applications such as counting cells, crops, fruits, trees, cattle, or even endangered species in the wild.

There are different ways to count the object(s) in a given image. One could make use of R-CNN based models for object detection as shown in the example below

and that would work just fine, but what do you do when you have a lot more people such as Figure 1? Will the same assumptions hold true? Do we have access to the labelled datasets that are in the format used by R-CNN and its variants?

In this post, we are going to build models that attempt solve this using the pre-trained ConvNet as backbone and a regression head for counting the crowd.

The network architecture is simple enough that I think this could be called the “Hello World” of the crowd density estimation task (pardon my ignorance if you know of simpler ways).

If we overlay the output, i.e., the density map over the image, we can see that the head of each person is highlighted. These highlighted points are what we want our model to learn to estimate. And, to get the total count, we sum the points together.

Dataset

We use the crowd counting dataset introduced in this paper. The dataset is known as “ShanghaiTech Crowd Counting Dataset”, and it has images with arbitrary crowd density along with the target labels. We train our model on Part A of the dataset. However, instead of using density maps provided by the dataset, we use the processed maps generated by the C3 Framework for convenience purposes. The C3 Framework is an excellent resource that covers multiple network architectures and their performance on different datasets. I encourage you to have a look at the paper and their repo.

Below we show a few sample images from the dataset. We also show the associated density map below each image. Annotating the dataset must’ve been a difficult task.

Pre-processing

Throughout the implementation, we follow the guideline and techniques used by C3 Framework. The C3 framework uses the following augmentation/transformation in PyTorch:

CenterCrop (to 224) → RandomFlip → ScaleDown → LabelNormalize (100) → ToTensor → Normalize

Models

We will use VGG16 as the backbone for our models in this post. Once we have the full training and evaluation infra ready, we can easily add more powerful models and compare its performance against the baseline models.

Baseline Model

As our baseline, we will use a pre-trained VGG16 network followed by 2 Conv layers and the upsampling layer to match the target density map (m x n). Recollect that we scale down the input image and target by a scaling factor, so the final layer needs to keep that in consideration.

Evaluation

We will first check how well the model is able to overfit the training data. We visualize its performance by comparing the actual vs. predicted crowd count. Each tab shows the predictions on the images of a given input size. The better the model, the more points lie closer to the diagonal line. As you can see in the plots below, the model does comparatively better for images with crowd ≤ 1000. However, its performance begins to suffer as the number of crowd increases. Why is that? I leave it for you to find out (Hint: it has something to do with the input size 😉)

See original article to checkout the plots from three different tabs

Let us now review the images our model gets right and the ones where it struggles. Here, we display the images from both the train and test set.

Two things stand out to me: 1) the images with better prediction only contain people 2) the orientation of the heads in the image. Another crucial insight is that as you increase the input image size, the model starts to perform better. Review the plots in both 224x224 and 448x448 tabs to confirm this.

The sample from the test set also seems to confirm that the images with many different objects such as trees along with people make things difficult for the model. The last image is difficult for the human eye as well.

VGG16 with Decoder

We now move to another simple yet more powerful model that, too, uses the pre-trained VGG16 as its backbone. We make use of CONV and CONVTRANSPOSE layers, the C3 paper refers to these layers as the decoder.

We train this model with the same hyper parameters as the baseline model for 400 epochs.

See original article to checkout the plots from different tabs

According to the C3 Framework, both the models will have comparable performance but VGG16 with Decoder will generate more precise density maps. We can see this in the table and examples below. Our numbers are no way near the ones reported by C3 Framework, which I mainly think is because they use a higher input size to train their models.

And finally we overlay the density maps generated by both the models on a given image. We see that VGG16 Baseline is spot on ✅in terms of the actual count but VGG16 + Decoder generates tighter density maps.

You can try increasing the input size and play around with the models yourself. If you can get good enough model, you can perhaps help answer the question on whose rally had more people🤣. The code is available on my GitHub Repo. In case you haven’t noticed, if you hover over the cover image you will see the density map generated by VGG16 Decoder.

For more thorough evaluation of different models, check out my supplementary post here.

Source: https://www.katnoria.com/crowd-density-eval/

What Next

You can try tuning the hyperparameters, finding the right learning rate and or model architecture to get better performance. Here is the list of things I would try next if I were to make it useful:

Add regularisation
Use a powerful backbone such as Resnet variants
Use other models from C3 Framework
Given that we have only 300 samples (very low in terms of DL), you could try U-Net which is known to perform well on tasks such as cell segmentation
Spatially divide the image into sub-regions called closed sets, train the model on closed sets as suggested by the paper “From Open Set to Closed Set: Counting Objects by Spatial Divide-and-Conquer”. The authors claim that this approach generalised well and can achieve the state of the art performance on a few crowd counting datasets
Use the encoder-decoder based approach highlighted in the paper “Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting”. They claim that their model can perform well on dense as well as sparse crowds.

References & Links

C3 Framework: I learned a great deal from this paper and their code [PAPER | GITHUB]
A Survey of Recent Advances in CNN-based Single Image Crowd Counting and Density Estimation [LINK]
U-Net: Convolutional Networks for Biomedical Image Segmentation [LINK]
From Open Set to Closed Set: Counting Objects by Spatial Divide-and-Conquer [LINK]
Encoder-Decoder Based Convolutional Neural Networks with Multi-Scale-Aware Modules for Crowd Counting [LINK]