Social Distancing Detector using Deep Learning and Depth Perception

Osama Fawad
Analytics Vidhya
Published in
9 min readJul 3, 2020
Output image highlighting in red those persons violating the safe distance recommended by WHO

Abstract

The world is going through a tough time due to the highly contagious COVID-19 pandemic. Although this is a novel variation of the corona virus family having no vaccine as of now, the biggest solution lies beyond any pharmaceutical vaccine — in the form of centuries old practice of social distancing. Since this virus can be spread by exhaling the air, which makes everyone in close proximity to the affected person highly vulnerable of being infected. Therefore, being at a safe distance from everyone in public is the easiest way to ensure safety against the virus.

This software aims to monitor the distance between each person in a given frame, and would calculate and highlight the number of people not maintaining the safe distance recommended by the World Health Organization. This could be deployed to any public area such as roads, streets, market places, offices, malls and mosques, to ensure that precautionary measures are strictly followed in order for life to resume to normalcy.

There is a general misconception that the soldiers of this war are only the health workers, the AI community too can fight from the front-line, and there are countless examples where AI has been deployed to raise awareness through predictive models, through the use of Reinforcement Learning to aid in finding optimal protein structures for drug development, and through computer vision based diagnosis systems. The proposed software intends to make a vision based monitoring system working on live camera feed, and displaying the status of violations in real time.

Introduction

The need for social distancing has never been as important as it is today, it is a practice that does not only impact the individual observing it or not, rather effects the whole world in a series of chain reactions caused by this little act. This ancient times’ cure has been observed over the centuries in many viral pandemics. Due to less globalization and more observable symptoms the spread were comparatively easier to control back then, however, today in this age of globalization no place is safe from this. There is, however, one positive side to it and that is the technology aspect, never before could the people and governments rely on technology to enforce such measures, and that is precisely the purpose of this project.

This Computer Vision based detection system would allow us to highlight in a given video the number of people who are violating the safe distance advised by World Health Organization. Apart from telling the overall status for example the amount of people following the guidelines or violating them, it highlights that individual on the screen so the authorities can take action. This can record the data at the place it has been deployed, for example a mall, and then the management and authorities can make decisions based on this data, this data could further be used in any other analytics project. This project is a proof of concept and without some hardware calibration it is not possible to achieve the exact distance, the answers will carry some small error which can be ignored.

Violation of social distancing and other precautionary measures can be fatal, yet very common in almost all the public places, despite the gravity of the situation few have turned towards technology to tackle it. Currently there are no commercially used computer vision based systems to monitor and enforce social distancing. The problem is not difficult, nor are the pipelines used here are new to the computer vision community. However, due to it being a relatively young disease, not much work related to it could be published, the lack of literature makes it difficult to assume the problems currently faced by much of the computer vision scientists in solving the issue. This project aims at giving back to the community, in a form of a prototype that can be improved further by future researchers.

Methodology

Object Detection Algorithms will be used to detect localized persons within the image frame. Faster R-CNN, SSD (Single Shot Detection), and YOLO (You Only Look Once) are among the three most common architectures for Object Detection problems. Despite each having their own advantages, the one used in this project was SSD (Single Shot Detection) which is less computationally expensive and can be processed in the fastest time. SSD was the only one working flawless and smoothly on CPU in almost real time. Among the many architectures of CNN the one used here is the MobileNet version of it that was developed by Google, it stands out for being very light and fast.

COCO dataset is a publicly available dataset having over 80 classes, the MobileNet SSD was trained on COCO dataset. Since we needed only 1 class from that dataset, that is ‘person’ class, thus all other classes were filtered out. The pre-trained MobileNet on COCO dataset was used instead of training from scratch, the weights and architecture were imported through transfer learning. The implementation was done on Python due to its easy to use libraries such as Tensorflow, Keras, OpenCV, and many more

Once the persons are detected in the frame the distance of a person from all others would be taken through Euclidean Distance, a threshold will be set and all those having distance less than the threshold will be highlighted by a red bounding box. The total number of violators will be displayed on the screen. The threshold or the minimum distance will be 1 metre as recommended by WHO.

Although the best approach to get distance is to incorporate the vision part with depth sensors such as LiDAR, Radar, or Sonars, but since this is a purely software based project that option is out of bounds. Camera calibration is required whenever this is being deployed, calibration will vary depending on the camera. For depth a pinhole camera model and similar triangle approach was used to estimate the depth of person from camera, doing so resulted in more accurate distance calculations between the persons.

Figure 2: Depth using camera
Figure 3: The equations for Focal Length, Depth and Height

Figure 2 shows the diagram that is the basis of Similar Triangle Approach, figure 3 mentions the equations that we used to find the depth of each person from camera. In it we had to assume one thing, that was the focal length, only because physical calibration was not possible as the video used did not mention camera specifications and some external values, but if we apply it in our own environment, we can take some values on which we can do the calibration. The depth of persons were used to get 3-D coordinates of each person, and Euclidean Distance of those 3-D coordinates resulted in the distance between each detected person in the image. This added feature resulted in much more improved depth and consequently more accurate distance between persons.

Results

The following image shows the final result after detection and finding the distance. Those with red bounding box are violating the social distancing guidelines as laid out by WHO, and those with Green Bounding Box are at a safe distance from each other. The rest of people with no bounding box were not detected. The status is mentioned on the top left corner of screen.

Figure 4: Result on an image

The summary of all the detections including the mid-points of each detected persons, the depth of a person, and the overall status of social distancing are printed at the end. Figure 5 shows the printed output status, this can be further used to build another dataset for different kinds of analysis.

Figure 5: Status of all the findings are printed

The detections were then passed through a video and there was a trade-off between the speed or FPS (frames per second) and the detection accuracy. The following figures shares how greater accuracy resulted in lower FPS and vice versa. This depended on the input layers going to the network, the greater the input image size, the greater were the input layers and complexity, consequently resulting in greater accuracy — but at the cost of speed.

Figure 6: Output on a video with more input layers(greater input dimensions)
Figure 7: Frames per second for output video having more input layers
Figure 8: Output on video with less input layers (smaller input dimensions)
Figure 9: Frames per second for video with smaller input dimensions

For the video each frame was treated individually, the status printed on the screen were for each frame as detections varied from frame to frame. As for the distance there were some errors due to optical illusions, although it was tried to be minimized but could not completely diminish the errors.

The full code with detailed explanation will soon be available on my GitHub. Attached here is the link of the repository https://github.com/osama-AI/Social-Distancing-AI

Conclusion and Future Improvements

This project was a proof of concept and without some initial calibrations, which would depend on external parameters, it would not be free of error, but despite the novel approach used the result was near accurate. There were some limitations due to which some extra features were not added to this project which could had increased the performance. Despite the lack of parameters this showed great results with the potential to be a framework for all future work of this kind.

The following are some of the recommendations for future that could make this even more accurate

1. Use a more sophisticated and complex model for Object Detection such as Faster R-CNN, or even YOLO. The CNN architecture used here was MobileNet, but VGG or Inception, having millions of more layers, were complex enough to extract more features out of it. The only reason they were not used is because of the computational processing required for those sophisticated versions means that it is impossible to run them in real time on CPU.

2. To get the accurate depth from camera to person the ideal technique was to incorporate depth sensors such as LiDARs and Radars to be fused with the Vision part. These sensors are the most accurate and easiest way of finding the depths from camera.

3. To avoid optical illusions that could result in inaccurate estimations of depth due to different camera perspective the perspective transformation was applied to convert it into a birds-eye view. This would prevent the optical illusions and all persons detected through a top view would have been from almost same height from camera. The problem with this was that although it could had made computing the distances more accurate it resulted in more inaccurate detections.

4. Deep SORT algorithm for detection and tracking could make it easier to detect on video, instead of detecting a new person in each frame this could give a unique ID to the detected person which will remain with that particular person in all the frames. This is specifically for tracking applications.

5. The method for depth estimation required some external value, and depending on that value the rest of distances were to be calibrated. Since that external value was not available thus an assumption was made for the value of Focal Length, if it is actually deployed somewhere it is suggested to actually calculate the focal length of the camera, which would be different for all cameras. This assumption was made to carry on with the project, more than an exact result it was meant as a proof of concept which could give more accurate results if calibrated before using. Stereo Camera can be used to get more accurate depths using disparity matching.

--

--