Large Scale COVID19 Contact Tracing using AI and Vision powered GeoLocalization — A.Eye-Vision

Presented here is a smart video surveillance platform placed amongst the top 3 award winners of the national hackathon conducted by NVIDIA and the Centre for Development of Advanced Computing (CDAC), Govt. of India titled “Supercomputing using AI, ML, Healthcare Analytics based Research for combating COVID19” under the National Supercomputing Mission. Link to announced results: https://samhar-covid19hackathon.cdac.in/#Result

Team Members — Souham Biswas, Dr. Sanjay Boddhu, Dr. Landis Huffman, A Naveen Kumar. Special thanks to Srimukh for the support.

Multi-stream, RealTime Identification + Tracking + Geo-Localization of coughing/sneezing people from only camera input.

The novel coronavirus is very contagious with a viral growth factor ranging around 2–3. It spreads through coughing, sneezing, and being near infected persons. An example is shown below -

Example showing how droplets can easily spread across people even in conversation. Courtesy of experiments conducted by NHK World-Japan and other researchers

Given this easy spread mechanism, it is important to track and geo-trace the affected.

Contact tracing apps (like Aarogya Setu in India) rely on documented patient data to identify the infected. They leverage Bluetooth to share anonymized IDs with neighboring devices running the same app. If it spots an “infected” ID in the vicinity, the user is notified and his/her contacts in the vicinity are recursively traced.

Contact tracing this way yields a loose sense of infection hotspots over long time periods. Given the 14–15 days viral incubation period, the framework also lags behind by that much time.

This elicits the need for a system which can do contact tracing of prominent symptoms directly, instead of tracing the actual disease (as then it is already too late); Symptoms like coughing and sneezing begin to show very early from the onset of the disease which can be tracked.

The core methodology powering A.Eye-Vision involves real-time processing of multiple video streams and detecting, localizing, identifying “infected” persons (showing symptoms like coughing/sneezing), and tracking them across streams along-with contact tracing.

The pipeline can process up to 6 simultaneous 720p RGB video streams at 10–15 frames/second on a typical laptop GPU with localization error ranging up to 1 meter (max) and symptom recognition accuracy at 94%.

Screenshot of the running pipeline in debug mode. Here, it is handling 6 simultaneous 1080p streams.

Deployment scenario -

These video streams are fused in real-time on the server and visualized in the form of an augmented map of the region on the dashboard.

The augmentation includes an overlayed heatmap denoting the probability of infection at any point on the map, along with moving blips on the map (of varying transparencies — based on how many times they’ve been observed coughing/sneezing) representing individuals currently being tracked by A.Eye-Vision.

There is also a provision to retain the last known locations of individuals who move out of tracking regions.

An initial visualization of the dashboard which is visible to the operator/end-user looks something like this -

User dashboard UI part

These visualizations may be used to assess the risk of a given region and accordingly perform targeted deployment of measures for containment and/or aid delivery.

Localization -

Localization of the tracked persons is done using in-house devised depth perception algorithms at HERE Technologies. Our algorithms use a combination of machine learning, traditional computer vision, and geometry to perform dense 3D perception from monocular image inputs and subsequent aggregation.

An example visualization of this 3D perception algorithm titled “SegSnap” in action can be found in the video below -

SegSnap3D — fast depth-perception algorithm for 3D perception + localization from a single monocular image; devised in-house at HERE Technologies (US Patent applied; authors — Souham Biswas, Dr. Sanjay Boddhu)

Demonstrated in this video is the 3D point-cloud reconstruction of the road and detected traffic signs from a single image. Each point here is also semantically labeled and is instance-aware. This algorithm is very lightweight and highly parallelizable as its implementation is heavily vectorized.

Given that our algorithm localizes a target with respect to the reference frame of the camera, if the GPS latitude-longitude of the camera is known, the target can also be assigned a latitude-longitude by using the camera GPS + target localization with respect to the camera to compute final GPS coordinates of the target which can be visualized on a map and tracked.

This algorithm can be used for localizing pedestrians from typical viewpoints of CCTVs and drones. The CCTV pipeline is ready.

We currently have an initial real-time version of the pedestrian detection + localization system deployed on the NVIDIA Jetson Xavier platform (https://www.nvidia.com/en-in/autonomous-machines/embedded-systems/jetson-agx-xavier/). The video is shown herein.

The capture in this video is being done by a car.

In this video, the 2 numbers beneath the pedestrian bounding boxes denote the depthwise (along the axis originating from the camera and going out) and the lateral distance (sideways distance from the depthwise axis) of the pedestrian in meters. The opacity of the box depends upon the distance from the camera. This is running at 15 frames/second on the Xavier.

Detection -

The pedestrian detection is being done by a HERE-proprietary object detection neural net based on the one powering HERE LiveSense Mobile AI SDK (https://developer.here.com/products/live-sense-sdk)

The detector is extremely fast and plays a major role in affecting the benchmark results (presented above). It is inspired from various models like MobileNet, SSDLite, etc. and has a mAP (mean average precision) score of 56% while the current state of art is at 54%.

Since the detector was designed to be deployable on mobile phone devices, it works brilliantly with the A.Eye-Vision use case as here, we are using GPU hardware.

Symptom Recognition -

Symptom recognition of coughing and sneezing actions is done using gait recognition. A few points about the model are presented below -

  • Takes in a 5 channel 70x140 image crop of a pedestrian (sourced from the detector). Each channel represents a grayscale image of the person in the previous 5-time steps (typically 1–1.5 seconds apart).
  • Simple Lightweight architecture with residual skip connections + conv layers followed by dense layers [Conv->Conv->Conv->Conv->Dense->Dense].
  • Out model achieves 94% classification accuracy over a class-balanced validation set of 1300 action-sequence crops.
  • Real-time performance speed -> 60–70 crops/second on a laptop.

The model is a typical Convolutional Neural Net with 3 classification categories namely —

  1. Cough
  2. Sneeze
  3. None (When the person behaves normally)

The dataset used to train, build and design the classifier is the BIISC dataset (https://web.bii.a-star.edu.sg/~chengli/FluRecognition.htm)

Dataset Preparation -

The dataset is organized as sets of videos of different persons (1 person-action per image) performing different actions in a 16:9 frame binned in to different action classes. The challenge here is that the person occupies a very small fraction of the frame. Like the one shown herein.

Sample frame taken from one of the video files sourced from the BIISC dataset. Notice how the person occupies such a small fraction of the image.

Since we will be performing gait recognition on the bounding box crops coming from the detector, we have to prepare a similar dataset from these videos.

This means, the crops of the bounding boxes surrounding the person need to be extracted.

We devised a simple and fast method to do this leveraging difference-of-images.

Extraction of person crop from the input frame using difference of images.

Shown above is a visualization of the process. The following steps are followed to extract the person bounding box crop-

  1. For a given frame as depicted in (1), we compute the difference of images between the current and the previous frame.
  2. Canny Edge detection is applied on the difference image to obtain image (3).
  3. Using the edges obtained in the previous step, we perform nearest-neighbor clustering to obtain different interest points as shown in (2).
  4. Using these interest points, we compute the bounding box as denoted in (4).

Training Plot -

Progress of validation accuracy of symptom recognition with training

Long Term, Multi-View Tracking -

The tracker used in the pipeline allows for re-tracking of individuals who move out of camera covered regions and re-enter any camera-covered location.

This is a custom tracker devised at HERE Technologies which leverages TLD and GOTURN trackers with the power of deep learning.

TLD is used for long term support of the subject and GOTURN for feature signature map, which generates visual signatures for cross-frame referencing.

Conclusion -

  • This gave us interesting insights on how Computer Vision, AI and Supercomputing can solve most pressing human problems.
  • The competition was a good exercise to test one’s ability to apply ML in new domains.

Next Steps –

  • Expansion to support drone-based surveillance of locations without CCTV hardware.
  • Cloud-based scaling to serve larger areas.
  • Further usage of HERE location technologies and ~30 years worth of historical location data to derive more insights.

--

--

Souham Biswas
Machine Learning & AI in Automated Map Making

Research Scientist II at Amazon | Writer at TowardsDataScience | Musician