Building a Crowd Counting model using Deep Learning
This year of 2020 feels like to have been gulped up by this COVID-19 pandemic, which is becoming more infectious day by day and countries are seeing multiple recurrent waves, involving spikes of active cases. Even the countries considered to be developed or having superior health infrastructure are not yet able to have a complete grasp on it to deal with.
One of the most effective ways to control the spread of it has been social distancing, in breaking the chain of transmission. Yet, it seems like not all people are ready to follow those basic prevention measures. Especially in a country like India, where there is already a dense population density and which is also reflecting in the total number of COVID-19 cases, i.e 8.73 million as of now, the need for crowd monitoring is so important.
Keeping this in mind, we can use our knowledge in the field of deep learning and try helping the concerned health authorities.
What is Crowd Counting?
Crowd counting is a technique to estimate the number of people in an image or a video stream. Visual counting or tallying is an open set problem,i.e., the number of people that can be present while estimating can range from [0,+infinity). For example, consider this below image and try to predict how many people are in the picture(even though I know all most of you will be doing is making a wild guess xD).
In this picture, there is a very high density of people in the region, making it a very huge and taxing task for our brain to accurately predict the number of people. If we start counting like say, from the top left, going progressively to the right side, it would be a hard task, where most of the time we would be messing it up even in the middle of the journey. But, a machine can do it. Just feed the logic to it and it will learn from the occurrences (that's what Machine Leaning basically is) and give us the near to be precise count.
Crowd Counting — Methods and Techniques
Since the discovery of the problem statement, several techniques have been used to come up with a solution. Using basic Machine Learning and Computer Vision algorithms like object detection, regression, and density-based approaches, computer scientists developed a solution to predict crowd density. But these had to be dealt with some challenges like variations in scale, non-uniform density, etc. Later, with the advent of Convolutional Neural Networks (CNN’s), these challenges were overcome and more attention was shifted towards them.
Let's discuss major methods and techniques being employed for getting the most approximate number of people in a crowd.
1. Detection based methods
In this method, we use a shaped window-like detector to identify people based on different classifiers in an image or video and count the number. Well trained classifiers are required so that they can extract low-level features(which include edges and blobs).
These detection based methods work well for detecting faces, but are not able to give satisfactory results when a dense crowd is present in a picture or video. The target features are not clearly distinguishable and/or visible while working with a dense crowd.
2.Regression based methods
The method of counting by detection is not able to perform nicely while working when there is a dense crowd and the randomness or clutter in the background is high. The regression-based methods can overcome those challenges, as they can extract low-level features.
Patches of an image are cropped and then those patches extract the low level features like,edge values,foreground pixels,etc.Regression methods are able to directly map the input images to scalar values.But the problem with these methods is that they are not able to correctly understand crowd distributions,which is overcome by density based methods as they perform pixel-wise regressions for getting better performance of the model.
3.Density estimation based methods
Density based estimation methods have the ability to be able to localise the crowd.They do not focus on explicitly detecting each individual.
First of all,density maps are created for the various objects.This approach focuses on the density and localisation of the crowd space,while traversing through the images and then by learning the mapping between local features and object density maps.The full density map of the overall data is obtained by concatenating discrete object density patches.A random forest regressor can be used to learn the non-linear mappings.
4.CNN based methods
Since the arrival of these CNN’s,these are the most reliable methods in order to achieve better accuracy over the other above discussed conventional approaches.In computer vision ,there have been many CNN’s designed specifically to deal with the crowd density problem.Lets divide them into different groups for proper understanding:
- Basic CNN’s — These models involve initial knowledge of Deep Learning,comprising of basic convolutional layers, kernels, and pooling layers.
- Scale-aware models —They are more robust and powerful CNN wherein multi-column or multi-resolution architectures are used.
- Multi-task CNN frameworks — They not only are able to give the count of number of people,but also can perform tasks like,crowd-velocity estimation, foreground-background minimalisation,etc.
CrowdNet is one of the other popular CNN which is able to capture both low level and high level features of an image.It is more like a combination of deep and shallow Convolutional Network framework.The dataset is augmented to learn scale-invariant representations to overcome the challenges faced by other methods for counting.It captures the high-level semantics needed for dense crowd counting and returns the density maps.
CSRNet,a technique used in Deep Convolutional Network and which we are going to implement here, is the most widely used while working with counting problems.It is capable of extracting high-level features and generating high quality density maps without expanding the network complexity,as shown below in the image.
CSRNet Architecture:
CSRNet uses VGG-16 technique on the front end as it has faster transfer learning rate.The output size that is obtained from a VGG is 1/pth size of original input size.Dilated Convolutional layers are also used in the back end of CSRNet.
Now,obviously one would ask that “What exactly are these Dilated Convolutional layers?”Well to explain that,look at the above image.
Dilated convolutions are used in order to increase the kernal size without increasing the number of parameters.If the dilation rate is 1,we take the kernal space and convolve it on its entirety.Whereas, if dilation rate is changed to 2,the kernal expands while convolution is taking place,as shown in the above figure.
Building our Crowd Counting Model..
So,now the most interesting part awaits us,i.e. starting building model from scratch.Without any further a-do lets get started!!
We are going to implement CSRNet on a ShanghaiTech dataset,popular for crowd related stuff.It has 1198 annonated images of people in crowd.You can download ShanghaiTech Dataset from here.
Also,do make sure you got CUDA and PyTorch installed on your systems,as they are pre-requisites for working with the model.
git clone https://github.com/leeyeehoo/CSRNet-pytorch.git
Use this above block code to clone the CSRNet-pytorch repository and move the dataset into the repository you cloned above by changing the root path and then unzip it.
This above step of generating density maps for each image in the partA of the dataset is going to take a considerable amount of time and also needs a stronger GPU to take less time.So,maybe the above part can be a bit frustrating for you…arghhhh!!
Now,let us take a sample image from the dataset and generate its ground truth heatmap by writing this block of code.
plt.imshow(Image.open(img_paths[0]))
Now by writing this below block of code,we will get the density heatmap of our sample image and the output as shown below.
gt_file=h5py.File(img_paths[0].replace('.jpg','.h5').replace('image','ground-truth'),'r')
groundtruth = np.asarray(gt_file['density'])
plt.imshow(groundtruth,cmap=CM.jet)
To count number of people in the sample image,write this code:
np.sum(groundtruth)
We will get output to be : 270.32568
Similarly,we can generate values for part_B.
We are done with getting our ground truth values with the images.Now lets train the model!!
cd CSRNet-pytorch
python train.py part_A_train.json part_A_val.json
The training is going to take a lot of time.But from my experience,if you dont want to wait for a long amount of time,i suggest you to have 100 epochs instead of 1000 and also maybe consider decreasing the size of the dataset files.This can decrease the accuracy of the model but still if you dont have a stronger GPU,doing the above steps will help.
Finally,now its time to check our model’s performance on an unseen data(testing the model).
To check MAE(Mean Absolute Error),i.e.,our model evaluation metric,write this piece of code below.We achieve MAE of around 75 which is pretty good and proves that CSRNet works good for crowd counting.
Once this done,we will finally be able to predict the number of people quite close to actual number.
As we can see,our prediction is quite close to original observed value which shows that model’s performance is good.The model is finally implemented successfully!!
End notes
As we can already see crowd counting has so many diverse applications and software can be used to alert the health authorities if there is a dense crowd to either maintian social distancing or prevent from a stampede happening.So,now the significance of these models has already increased a lot during 2020 and more research is going to happen in this sector to make better predictions.
I hope you guys understood the implementation of the CSRNet based model and are motivated to work in the Deep Leaning and Machine Learning…