Edge AI

Published in

Engineering@Nauto

7 min readFeb 6, 2020

What does Edge AI really mean? I was asked this question several times over and decided to share my thoughts on this topic. Edge AI commonly refers to components required to run an AI algorithm locally on a device, it’s also referred to as on-Device AI. Of late it means running Deep learning algorithms on a device and most articles tend to focus only on one component i.e. inference. This article will shed some light on other pieces of this puzzle.

Experimental Setup

Edge devices are very diverse in their cost/capabilities, to make the discussion more concrete, here’s the experimental setup used in this series:

Qualcomm Snapdragon 855 Development Kit [4]

Qualcomm Snapdragon 855 Development Kit.
Object Detection as the Deep learning model to be run on an Edge device. There are a lot of good articles describing the state of the art in object detection [survey paper]. We will use Mobilenet SSD model for Object Detection in this series.
Tensorflowjs to quickly run object detection model in nodejs environment

Why run AI algorithms on Edge

Why can’t we rely on the cloud to run AI algorithms? After all, scaling resources to run an AI/Deep learning model to match your performance needs is easier on the cloud. So why should one worry about running them on an edge device with compute and power constraints? To answer this question let’s consider two scenarios:

a) Cloud based architecture, where inference happens on the cloud.

b) Edge based architecture, where inference happens locally on a device.

(To keep the comparison as fair as possible, in both the cases a nodejs webserver along with tensorflowjs (cpu only) will be used, the only difference being that in case a) webserver will run on an EC2 instance and in case b) webserver will run locally on an edge device. The goal here is NOT to have an optimized implementation for a platform (cloud or edge) but rather to have a framework to do a fair comparison.)

Cloud based architecture

Here’s how a cloud based setup would look like, it would involve the steps detailed below:

Cloud only Architecture for Inference. (image references at end).

Step 1: Request with input image

There are two possible options here:

We can send the raw image (RGB or YUV) from the edge device as it’s captured from a camera. Raw images are always bigger and take longer to send to cloud.
We can encode the raw image to JPEG/PNG or some other lossy format before sending, decode them back to raw image on cloud before running inference. This approach would involve an additional step to decode the compressed image as most deep learning models are trained with raw images. We will cover some more ground on different raw image formats in future articles in this series.

To keep the setup simple, first approach [RGB image] is used. Also HTTP is used as the communication protocol to POST an image to a REST endpoint (http://<ip-address>:<port>/detect).

Step 2: Run inference on cloud

tensorflowjs is used to run inference on an EC2 (t2.micro) instance, only a single nodejs worker instance (no load balancing, no fail over, etc) is used.
Mobilenet version used is hosted here.
Apache Bench (ab) is used to collect latency numbers for HTTP requests. In order to use ab, RGB image is base64 encoded and POST ed to an endpoint. express-fileupload is used to handle the POST ed image.

Total latency (RGB) = Http Request + Inference Time + Http Response

ab -k -c 1 -n 250 -g out_aws.tsv -p post_data.txt -T "multipart/form-data; boundary=1234567890" http://<ip-address>:<port>/detectThis is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/Benchmarking <ip-address> (be patient)
Completed 100 requests
Completed 200 requests
Finished 250 requestsServer Software:
Server Hostname:        <ip-address>
Server Port:            <port>Document Path:          /detect
Document Length:        22610 bytesConcurrency Level:      1
Time taken for tests:   170.875 seconds
Complete requests:      250
Failed requests:        0
Keep-Alive requests:    250
Total transferred:      5705000 bytes
Total body sent:        50267500
HTML transferred:       5652500 bytes
Requests per second:    1.46 [#/sec] (mean)
Time per request:       683.499 [ms] (mean)
Time per request:       683.499 [ms] (mean, across all concurrent requests)
Transfer rate:          32.60 [Kbytes/sec] received
                        287.28 kb/s sent
                        319.89 kb/s totalConnection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   5.0      0      79
Processing:   530  683 258.2    606    2751
Waiting:      437  513 212.9    448    2512
Total:        530  683 260.7    606    2771Percentage of the requests served within a certain time (ms)
  50%    606
  66%    614
  75%    638
  80%    678
  90%    812
  95%   1084
  98%   1625
  99%   1720
 100%   2771 (longest request)

Histogram of end to end Inference Latencies for Cloud based architecture (bucket size of 1s). It shows the inference latencies for requests generated by Apache Bench (ab) in a given second.

End to End Inference Latencies for Cloud based architecture sorted by response time (ms). This article explains the difference between the two plots.

As we can see here 95% percentile request latency is around 1084ms.

Edge based architecture

Web server (which runs tensorflowjs) is running locally on an edge device (Qualcomm Snapdragon 855 Development Kit [4]). We repeat the same steps using Apache Bench (with http requests to localhost this time instead of remote sever) and the results are as follows.

ab -k -c 1 -n 250 -g out_device.tsv -p post_data.txt -T "multipart/form-data; boundary=1234567890" http://localhost:3000/detectThis is ApacheBench, Version 2.3 <$Revision: 1843412 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Finished 250 requestsServer Software:        
Server Hostname:        localhost
Server Port:            3000Document Path:          /detect
Document Length:        22610 bytesConcurrency Level:      1
Time taken for tests:   80.689 seconds
Complete requests:      250
Failed requests:        0
Keep-Alive requests:    250
Total transferred:      5705000 bytes
Total body sent:        50267750
HTML transferred:       5652500 bytes
Requests per second:    3.10 [#/sec] (mean)
Time per request:       322.755 [ms] (mean)
Time per request:       322.755 [ms] (mean, across all concurrent requests)
Transfer rate:          69.05 [Kbytes/sec] received
                        608.38 kb/s sent
                        677.43 kb/s totalConnection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       2
Processing:   290  323  36.0    317     737
Waiting:      290  322  36.0    316     736
Total:        290  323  36.1    317     739Percentage of the requests served within a certain time (ms)
  50%    317
  66%    323
  75%    328
  80%    331
  90%    341
  95%    357
  98%    397
  99%    473
 100%    739 (longest request)

Histogram of end to end Inference Latencies for Edge based architecture (bucket size of 1s). It shows the inference latencies for requests generated by Apache Bench (ab) in a given second.

End to End Inference Latencies for Edge based architecture sorted by response time (ms). This article explains the difference between the two plots.

As we can see here 95% percentile request latency is around 357ms. Below is a comparison of how the inference latencies stack up against each other.

Comparison of Inference Latency (Edge) vs Inference Latency (Cloud).

Optimization Opportunities

As you can see the latency numbers are fairly high, the numbers we obtained here are more like upper bound latencies, there are many optimization opportunities, some of them are detailed below:

Cloud based architecture:

Have multiple nodejs worker instances and load balance between.
Have multiple deployments (us-east, us-west etc) and route the request to the closest deployment.
Batch multiple input images and run batched inference on cloud.
Have a gpu based EC2 instance and use tensorflow-node-gpu to accelerate inference
Use a different communication protocol like MQTT geared more towards IOT / cloud connectivity to avoid overheads with HTTP.

Edge based architecture:

Have an optimized implementation for your Edge device. In this case for Qualcomm Snapdragon 855 Development Kit [4] inference would be accelerated on GPU / DSP or their NPU.
Most likely implementation on device would depend on native libraries through vendor frameworks like SNPE or tensorflow-lite.
Optimize the data path consisting of image capture from camera to feeding the deep learning models to run inference.

Conclusion

We looked in detail at one of the factors to decide if you need Edge-based solutions, as we saw if your application is tolerant to cloud latencies then cloud-based inference would be the quickest way to get going. However, if your application is latency sensitive then you can consider Edge-based solutions. Be sure to benchmark your particular use case to pick one vs the other. In addition to latency these are some of the other reasons to consider Edge-based solutions:

You already have an existing deployment of Edge devices and want to leverage it to save on cloud compute costs.
Devices which are not fully connected/have poor connectivity to cloud, edge based solutions becomes inevitable