Comparing different crowd counting methods

5 min readJun 13, 2018

I was tasked to come up with a crowd counting solution for an event called Makerfaire 2018 by flying up a drone and taking still images of the crowd. I had no knowledge about crowd counting and was only equipped with basic machine learning techniques that I had learn during my uni days. I had to go through many different articles and research papers just to understand the crowd counting landscape. This article is meant to be a brief summary of what I have learnt, and hopefully be useful to someone else out there who is struggling just like me before.

Evolution of Crowd Counting Methods

Reference: http://personal.ie.cuhk.edu.hk/~ccloy/files/crowd_2013.pdf

There are basically 4 types of counting methods in order of oldest to most recent: Detection Based, Cluster Based, Feature-Regression, and Neural Network.

1) Detection Based (Supervised)

A classifier is trained using a labelled set of training data that usually consists of full body shots of people. Then this classifier is then scanned over the entire test images in order to find pattern of intensities (such as shapes of people) which is consistent with the training data.

Examples of classifiers: RBF SVMs, Random Forests.

Evaluation
These systems work well for detection of faces, but less well for pedestrians because the images of pedestrians are varied (due to changes in body pose and clothing). It also suffers in crowded scenes were occlusion and scene clutter are inevitable. It performs even worse in surveillance applications, where resolution images is very low.

References
http://www.merl.com/publications/docs/TR2003-90.pdf

2) Cluster Based (Unsupervised)

It assumes that the way an individual moves, or other visual features (such as handcarry bags or clothing), are relatively constant and unique, and based on the trajectory of these features, we can group cluster together to represent independently moving entities.

Examples of Clusters: Bayesian Clustering, KLT Tracker

Evaluation
Training data does not have to be labelled, and works well on sparse crowd. However, it requires the target to be applied on to have continuous motion and inaccuracies can arise if people remain static in a scene, or there are 2 objects that share common trajectories over time. This method will not work on still images, it needs image frames.

3) Feature Regression (Supervised)

This is probably the most researched on and and used method so far. It requires you to identify the perspective map of the region of interst, then extraction of low-level features of the image fron the region, such as the foreground pixels or edges from the image. Properties such as the foreground area and total number of edges are then derived from the image, and then a regression model will be used to establish a direct mapping between these properties and the number of people in the image.

(Credits to the authors from the paper referenced above) A typical pipeline of counting by regression

Basically from the figure above, they first define the region of interest and the perspective map of the region of interest, then from the input image, extract low-level features such as the foreground pixels, and then pass them on as inputs to a regression model.

Examples of Regression Models: Linear Regression, Kernel Ridge Regression, Support Vector Regression, Gaussian Process Regression etc

Evaluation
The model trained is dependent on the perspective map. If the model were to be used in another scene of a different perspective map, it will have much inaccuracies in its result. However if tested on the same scene but an entirely new frame, it will work with decent performance.

4) Neural Networks (Supervised)

Deep learning has been in the spotlight in recent years, and it is natural that recent research on crowd counting has shifted to Deep learning methodologies which seem to yield promising results. Due to the nature of neural networks, meaningful features of the image that can help to achieve the end goal of people count is autonomously “discovered” during training. This means patterns that we do not intuitively see and/or are hard to handle (such as in the case where the feature might exist in many scales) can also be found and/or taken into account during training, which allow the image to be better represented in in the network as compared to other methods mentioned above.

An example of a convolution neural network whose input is an image, and the output is a density map of the crowd. The number of people is obtained by integrating the density map.

(Credits to the authors from the paper referenced below) The structure of the proposed multi-column convolutional neural network for crowd density map estimation

Evaluation
The trained model can be easily transferred onto another scene with a difference perspective and scales of the size of people. In other words, it can be used generally for other scenes unlike the models talked about above. Typically, with some extra fine-tuning of the last layers of the model for the specific scene, it can boost accuracy, however it should still give reasonable results without the fine-tuning.

Reference
https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/Zhang_Single-Image_Crowd_Counting_CVPR_2016_paper.pdf

Which methods should I use?

In my context, with only basic understanding of machine learning concepts and limited time, I did not have the expertise nor the time to train my own crowd counting model. I had to rely on pre-trained models for my project.

These were the factors considered when choosing the methods

Generality of model ( can it be used in different scenes of different crowd sizes?)
Ease of Use ( how easy can I serve it over an api so others can access it?)
Documentation ( is there enough info on how to use it?)
Source Codes ( are the codes readily available?)
Accuracy ( are the errors reasonable enough to accept?)

After considering these factors, I decided to use convolutional neural network (CNNs) pre-trained models for my project, mainly due to its generality and the fact that the source codes for various CNN are open source and available on github such as the one here by uestcchicken. Thanks open source!