Enforcing COVID-19 Regulations through Monitoring Social Distancing: A Computer Vision Approach

Detect, Visualize, and Minimize Risk!

Salwa Al Khatib

Published in

Zaka

9 min readApr 21, 2021

We, at Zaka, hope this blog post becomes irrelevant soon. The sooner the better!

For the meantime, while we’re still in the middle of a global pandemic, let’s examine how we can efficiently enforce COVID-19 regulations, specifically social distancing, to minimize risk and help save lives!

According to the CDC, social distancing is a safety practice that slows the spread of the disease. The recommended distance to keep between yourself and other individuals who are not from your household is 6 feet, whether indoors or outdoors. Social distancing is critical in reducing the spread of COVID-19 since this spread is mainly caused by people getting in close contact with each other. The end goal is “flattening the curve,” i.e. reducing the rate of infection transmission among individuals to alleviate some of the pressure on the healthcare system.

Figure 1. Flattening the curve of COVID-19: how not to overwhelm the healthcare system

Thus, automating the process of monitoring social distancing would be pivotal in enforcing COVID-19 guidelines and ultimately curbing the COVID-19 pandemic. As is the case with other daily life problems, Deep Learning and Computer Vision present a viable and efficient automated solution to this problem. In this blog post, we are going to explore the different aspects of this problem and the ways we went about solving them. Before jumping into the “how”, let’s check out the “what”, i.e. the result we are aiming for.

So without further ado, let’s get started!

Outline:

Problem Definition
Our Methodology
Results
Conclusion and Future Work

1. Problem Definition

The aim of this computer vision module is to detect social distancing violations from video feed. To achieve that, we need to estimate interpersonal distance between pedestrians and comparing that to the minimum allowed distance to be kept between individuals, which is 6 feet. This tool is especially useful now that cities all around the world are gradually stripping back lockdowns while still maintaining COVID-19 regulations.

2. Our Methodology

We built a 3-stage pipeline that addresses all the aforementioned subproblems. This pipeline utilizes OpenCV and NVIDIA TLT’s PeopleNet model.

2.1 People Detection and Tracking

The first stage of our pipeline is composed of a people detector and a tracker. Our model was trained on a dataset of 24,000 images with around 500,000 annotated people. We made sure that our data is diverse and inclusive of people from all walks of life to ensure robust and inclusive detection. At Zaka, we utilize NVIDIA TLT’s purpose-built models and open model architectures as our baseline models to develop AI solutions specific for our use cases. For this specific module, we used PeopleNet, which has DetectNet_v2-ResNet18 as its network architecture. The model was trained, pruned to around one-fourth of its original size, retrained, and lastly quantized to INT8.

After building a robust people detector, the next phase is to establish tracking and ID assignment to keep track of detected individuals throughout the sequential frames in our video feed. To do that, we utilized NVIDIA’s NvDCF, which is multi-object tracker based on discriminative correlation filter (DCF). It assigns a unique ID for each newly detected object and it tracks all previously detected objects. It defines a search region around the detected target location large enough for the same target to be detected in the search region in the next frame.

2.2 Perspective Transform

Consider the following approach: after detecting all the individuals in a frame using the People Detection model, we simply compute the Euclidean Distance between the centroids of each pair of individuals and compare that to the minimum allowed distance. That might sound like the right way to go about it; however, if we examine this a bit further, we can see that this approach is actually flawed.

We could have taken this approach if the center of projection, in other words the camera origin, was equidistant from all the points that make up the projection plane that we are monitoring. This is almost never the case. Usually, objects (in our case, people) who are closer to the center of projection are bigger in the image whereas those that are farther away appear much smaller. For instance, two people that are close to the camera may appear to have a larger distance between them than they actually have in real life, and the opposite would be true for two people farther away from the camera. Thus, this ‘depth’ effect is likely to bias our results and cause a lot of false positives and false negatives.

To overcome this obstacle, we must find a way to warp the view we have of our scene to a top-view perspective, otherwise referred to as bird’s-eye view. This is a technique that is often used in developing autonomous driving technologies. Here, we need to introduce the concept of the homography matrix, which is going to be instrumental in accomplishing this transformation.

A homography matrix is a transformation matrix that relates a projective plane from one camera view to another projective plane from another camera view. Relative rotations and translation between the two perspectives can be used to calculate this homography.

Figure 3. A transformation between two images planes (source)

In our case, we are going to use Inverse Perspective Mapping to produce a bird’s-eye view out of our frontal view frame. We can easily perform this using OpenCV:

cv2.getPerspectiveTransform: to which we pass the coordinates of quadrangle vertices of the region of interest (ROI) within our frame and the coordinates of the corresponding quadrangle vertices of our output image. This method outputs M, the transformation matrix that maps the input frame to the output frame. It is worth noting that the vertices of the ROI should make up a quadrilateral that has at least two parallel sides. The best practice of specifying this ROI is to follow the orientation of the image plane e.g. follow edge or lane lines when dealing with roads.

Figure 4. An example of perspective transformation, where (a) shows the recommended assignment of the ROI and (b) shows the output of the transformation

cv2.perspectiveTransform: which takes as input a matrix of points, the coordinates of the quadrangle vertices of our output matrix, and the transformation matrix M. It will perform the bird’s-eye view perspective transformation for us. In this step, we pass the centroids of the detected objects in the form of a matrix to the function, as we only need to transform these specific points to be able to proceed.

2.3 Interpersonal Distance Estimation

Next, we have to set a minimum allowed distance between individuals that we can monitor. However, setting this distance in pixels is not meaningful and would vary from use case to use case. Thus, we have to automatically estimate the number of pixels that correspond to 6 feet in the specific scene we have on our hands without human intervention.

Figure 5. Images showing an example of a certain number of pixels representing widely different distances, where in (a) 800 pixels might accurately represent 6 feet and in (b) 800 pixels is obviously much more than 6 feet

People Detection: The first method we suggest is to deduce the height of detected individuals from the height of the resulting bounding box we get from the People Detection model. This would be the easy, straightforward way to go about this. However, one might point out that the people detector can detect people whose bodies aren’t entirely visible in the frame, and thus this would skew the scale. In addition, some detected individuals might be much closer to the camera than others are, and this might also skew the scale. The only way we can remedy this while still using this approach is to constantly update our scale for every frame we pass through in hopes of attaining a robust scale over time that is not affected by outliers.

Figure 6. Images showing the People Detection Model in action, where (a) shows a near ideal situation for scale estimation and (b) shows an edge case in which the scale might be skewed

Pose Estimation: Another approach that might compensate for the shortcomings of the previous method is to estimate the height of people in pixels by computing the distance between their body joints determined by a pose estimator. The advantage of this method is that we can get more details on how much of the person’s body is shown in the frame e.g. the furthest joint detected is that of the torso.

Figure 7. 2D multi-person pose estimation using OpenPose (source)

The perspective transformation we applied beforehand has paved the way for this step of the pipeline. Now that we obtained a top-view perspective of the centroids in our ROI, we can compute the Euclidean Distance between the centroids of each pair of detected individuals. Here, we need to determine the unique combinations of centroids to avoid duplicates; we can use the itertools function for combinations: itertools.combinations() in python. Finally, we can easily point out which people are violating social distancing regulations.

3. Results

Now that our 3-stage pipeline is ready, we can visualize the results!

Here are some images where social distancing was detected inside of our specified ROI, where the red bounding boxes indicate people violating social distancing regulations.

Figure 8. Sample Detection (Original Video)

Figure 9. Sample Detection (Original Video)

Figure 10. Sample Detection (Original Video)

Check out the following video for a full demo!

4. Conclusion and Future Work

Let’s quickly summarize all the steps we took so far:

Built a people detection model with a tracker to detect all the people in the frame and determine their centroids.
Warped the perspective we have of the determined centroids in the ROI using cv2.getPerspectiveTransformand cv2.perspectiveTransform.
Estimated the minimum allowed distance in pixels either through the height of bounding boxes or the distance between body joints.
Determined the different combinations of centroids and computed the distance between them to finally analyze which people are in danger and which are not.

As you might point out, the monitoring of commitment to health guidelines for COVID-19 should be more nuanced than just looking at social distancing practices. Thus, the following directions could be taken in the future:

Since social distancing should be practiced along with other preventative measures, a possible addition could be monitoring whether the detected pedestrians are wearing face masks or not. Different types of alerts that reflect different levels of danger can be introduced, whereby more intrusive alerts would be raised for pedestrians violating social distancing and face masks regulations simultaneously.
Another possible addition that could increase the accuracy of our scale estimation strategy is to incorporate the gender of the detected individuals. Females can be assumed to be 5 foot 4 (162cm) and males to be 5 foot 7 (171cm).
Lastly, people counting and crowd density estimation can be important additions to this tool to make it more holistic.

Push Your Intelligence Further | AI Brainery

From basic Python knowledge to complete understanding and implementation of Machine Learning algorithms. Learn from…

academy.zaka.ai

Don’t forget to support with a clap!

Do you have a cool project that you need to implement? Reach out and let us know.

To discover Zaka, visit www.zaka.ai

Subscribe to our newsletter and follow us on our social media accounts to stay up to date with our news and activities:

LinkedIn — Instagram — Facebook — Twitter — Medium