Deploying computer vision models on the edge

Benedict Evans, tech writer:

“For as long as most people can remember, the tech industry has had a new centre roughly every fifteen years. A model of computing sets the agenda, and the company or companies that win that model dominate the industry, and everyone is scared of them, and then a new model comes along, forms a new centre, and the old model stops mattering. Mainframes were followed by PCs, and then the web, and then smartphones.

Each of these new models started out looking limited and insignificant, but each of them unlocked a new market that was so much bigger that it pulled in all of the investment, innovation and company creation and so grew to overtake the old one.”

It’s been 15 years since AWS’s founding. In that time, cloud has transformed data and computing for businesses. However, while convenient and cheap, the cloud consumed enormous bandwidth and increased latency as data had to be sent to the cloud for processing. This may not matter for many use cases, but in computer vision, this could be an operational challenge. An emerging alternative may be edge computing.

Edge computing is when computation is done where the data is collected. Without needing to transfer data to the cloud, we reduce network latency and bandwidth needs and cost. And that becomes increasingly significant as we deploy more IOT devices and AI models at the edge.

Take for instance, a video analytics use case that requires real-time inference. Deploying the AI on the AWS EC2 p3dn.24xlarge instance will cost (US/SGD) 253k for 3 years (not including the network and data storage cost). The compute power for this instance is roughly equivalent to 240 NVIDIA Jetson Xavier NX edge compute devices, which cost around US/SGD 138K (~50% lower cost). Furthermore, edge computing requires a network that costs only about US$6 per device annually. The cost-savings can be large. Furthermore, the Jetson series of devices was started in 2014, and improvements will increase its computational power. Besides Nvidia, Intel, Huawei and Xilinx are also engaged in the edge computing arms race, increasing the pace of development.

That said, edge computing is not without drawbacks. Centralised cloud computers will always have a far higher performance ceiling than edge computers. Improvements in cloud infrastructure and the natural economies of centralisation mean that cloud costs will continue to fall and be competitive. Meanwhile, edge computers that operate in the field need to be ruggedized (e.g. box encasing, cooling systems) and power-budgeted (e.g. battery packs that need to be swapped every few days), all of which increase operational complexity.

We shouldn’t see edge computers and cloud as competitors. Cloud and edge computing complement each other, specialising in different types of computational workloads and use cases. Also, it’s getting easier to integrate edge compute devices and the cloud, facilitating cross-talk between the two. For example, AWS IOT is a cloud based solution for managing and integrating IOT and edge devices.

AI at the edge

The main use case of edge computing is AI. Traditionally, AI is deployed centrally, especially for large and complex models that have high computation requirements. However, over time, more efficient neural nets, coupled with increasing power and decreasing costs for edge devices have made edge computing more viable.

The VA team in GovTech’s Data Science and Artificial Intelligence Division saw the potential of edge AI deployment and has been working on such use cases since 2019. The first of these is Balefire, a multi-stage smoking detection AI algorithm. While the latency requirements and bandwidth costs are not as extreme as say, autonomous vehicles, there are still needs to optimise AI algorithms for the edge to enable model inference over high frames per second (FPS) videos.

Bale… what?

Balefire is a concatenation of AI models that takes in a video feed to detect and count the number of smoking instances for trend analysis. It was first built and field-tested in 2019 with a partner agency. The previous version of Balefire implemented the following 4-model data inference pipeline (Fig 1):

  1. For each video frame, person detector extracts persons in the frame
  2. For each video frame, we run a tracker that uses bounding box coordinates and image embeddings to identify if the persons are new or appeared in previous frames
  3. For each detected person, we run a pose estimator to locate the face on the person
  4. For each face, we run a classifier model to identify if it is smoking or not

Why consider Balefire on the edge?

Balefire analyses videos to identify and count smoking instances. The initial version of Balefire was meant to analyse pre-recorded video files, but users requested Balefire to run on videos in real time, which is untenable on cloud due to latency issues and network bandwidth requirements. Hence, we pivoted to deploying Balefire on the edge. Latency is low since inference is performed at the point of data collection. And because we only send the inference results rather than the raw data to the cloud, the bandwidth required is only a few hundred bytes a second (as opposed to a 1080p 24fps video stream at ~5mbps).

Adapting Balefire to the edge

We tested Balefire on an Nvidia AGX Xavier with a test dataset consisting of 2400 15 sec videos labelled ‘1’ if there was a smoking instance and ‘0’ otherwise.

Balefire was originally designed to work on AWS cloud, which has more compute resources. Below is the speed of each model of the algorithm on AGX Xavier. Note that the last 2 models have to be repeated for every person detected and tracked. As seen, even with just 1 person in the video data, Balefire inference took >1s per frame. This means we could process less than 1 frame per second, which could lead to us missing detection for many smokers.

Rethinking the algorithm

So, we cannot just port the original Balefire to run on the edge device without modification. To make Balefire work well on an edge compute device like AGX Xavier, we needed a faster algorithm. To do so, we re-evaluated the necessity of each stage of the model, and experimented with new approaches to improve speed (and accuracy, along the way).

1. Head detection with pose estimation is an overkill. In the original Balefire, we thought we needed person detection, followed by pose estimation on detected persons, to identify head regions. We experimented and found that just using face detection to identify head regions works well too (and even outperformed in some cases). Firstly, it was more robust to occlusions, as bodies are larger than heads and more prone to overlap. Secondly, it did better in unconventional poses like squatting and sitting, which pose estimators struggle with. Facial detection is also much faster than person detection plus pose estimation. By switching to face detection only, we saved time and improved detection and tracking accuracy.

2. Simplify entity tracking within a video sequence. The original Balefire tracking model used image embeddings extracted from person bounding boxes to detect similarities between bounding boxes across time. We experimented how tracking would perform without this and found that embeddings increased latency but did not improve performance at all. Furthermore, tracking with facial boxes is superior to tracking with person boxes because there is less occlusion. So, removing the image embeddings greatly sped up tracking and tracking with facial boxes slightly improved tracking accuracy.

We thus simplified the 4-step algorithm to a 3-step one:

We also retrained our face classifier on a larger dataset, leading to an improvement of a few percentage points.

Model optimisations to improve speed

We found that the inputs to our models could be downsampled further without much loss of performance. We experimented with reduced input image dimensions and used the one with the best performance — speed tradeoff to cut processing time.

Also, our original models were in Pytorch. We converted our face detector and classifier to TensorRT models, which are optimised for Nvidia hardware. The TensorRT models are optimised to FP16, which helps increase inference speed albeit with minimal accuracy degradation (< 1% for our model).

Lastly, during inference time, as we can detect multiple faces in a single frame and may need to run classification on many of them, we decided to run batch inference and optimise our face classifier TensorRT model for larger batch sizes, which was significantly faster (18.2ms to 6.61ms per face for batch size 10).

How much did the model improve?

The speed-up was significant:

Not only are the model components run once per image significantly faster, but more importantly, the model components are able to process and infer more number of people within a frame in a similarly optimised fashion. Imagine we had 10 persons in the frame. Originally, that would take 150+100+1000x10 = 10250ms in the earlier version of Balefire model. Now it takes 20+28+6.61x10 = 114.1ms, a speed-up of almost 100x!

Full-stack solution pipeline

After we detect the smokers, we will need to store the incidents and notify end users via telegram. We found AWS, Azure and Google Cloud all had solutions. We decided to test a simple pipeline using AWS IOT, Lambda and S3.

The way we do so is by sending the insights to AWS IOT. Inside AWS IOT, we can write rules (AWS Rules to process the incoming insights). We wrote one to process and redirect insight for storage in S3 Bucket, and another to call an AWS Lambda function that will send a customised message to a telegram channel to notify of smoking activity.

Conclusion

Deploying complex AI at the edge is one of the many fun things we do in Govtech DSAID. It has different engineering considerations than the cloud. Edge computers are more limited in compute and cannot scale as well as cloud. Hence, AI algorithms often need to be optimised. But edge computing solves the problem of large-scale data streaming for real-time video analytics.

We have exemplified one such edge deployment in this article. Being able to optimise AI models at the edge sets us up for similar future AI deployments that may require real-time responses.

Other than single-stage AI inference capabilities embedded in Smart Cameras, AI on at the edge is still at an early stage. Further use cases will be enabled as AI will become faster and more powerful, while edge hardware becomes cheaper and more powerful. In addition, we expect user demand for AI applications to increase in the near term, which means we will likely be seeing increased uses of edge computing in the near future.

--

--