Supercharge deep learning (AI) inferencing with Amazon Elastic Inference & Amazon SageMaker Neo (Part — 2)

5 min readFeb 15, 2019

Introduction:In the last article, we took a look at Amazon Elastic Inference (EI) service. In this one, we will take a look at SageMaker Neo. Just like last time, we will conduct performance tests with SageMaker Neo to see performance benefits.

With SageMaker Neo, in many cases, TensorFlow, Apache MXNet, PyTorch, ONNX, and XGBoost model inference runs twice as fast.
Not just that, SageMaker Neo also provides light weight model runtime (around 1.5 MB in size). It enables deployment of sophisticated models to resource constrained devices.

SageMaker Neo opens doors to so many interesting possibilities for AI model deployment!!!

Image of Amazon Prime Air drone, used for representational purpose only.

Why we need a technology like SageMaker Neo?

When you finish deep learning training, typically the model is saved in a native format specific to the training framework. These days, some frameworks also allow you to export the trained model in ONNX format. ONNX aims to be the universal deep learning model format.

Problem 1:

However, these formats are hardware agnostic. There are no considerations as to which hardware the model is going to be deployed on. Much performance gains can be achieved if there are performance tunings that are specific to the hardware platform. SageMaker Neo helps you take maximum advantage of underlying hardware platform.

Problem 2:

The generic runtimes also have lot of unnecessary functionality and code that is not useful in inferencing. This bloats their size. This makes deployment on low resource devices harder.

Working with SageMaker Neo:

Model compilation process:

Model deployment process:

How exactly does SageMaker Neo performs its model optimizations?

Most machine learning frameworks represent a model as a computational graph: a vertex represents an operation on data arrays (tensors) and an edge represents data dependencies between operations.

The Amazon SageMaker Neo compiler exploits patterns in the computational graph to apply high-level optimizations including

operator fusion, which fuses multiple small operations together;

2. constant-folding, which statically pre-computes portions of the graph to save execution costs;

3. a static memory planning pass, which pre-allocates memory to hold each intermediate tensor;

4. and data layout transformations, which transform internal data layouts into hardware-friendly forms. The compiler then produces efficient code for each operator.

Demoes

Demo 1: Using SageMaker Neo on SageMaker platform

While SageMaker Neo has been open sourced, Amazon SageMaker service makes it particularly easy to benefit from Neo.

Amazon SageMaker provides Neo container images for Amazon SageMaker XGBoost and Image Classification algorithm generated models, and supports Amazon SageMaker-compatible containers for your own compiled models.

Please take a look at this notebook (link).

Step1- For our testing, we train an image classification model with 18 layers. For this demo, we are not particularly concerned with model quality.

Step2- Deployment of the vanilla model

Step3- Model compilation using SageMaker Neo compiler

Step4- Deployment of the compiled model on the container having SageMaker Neo runtime

Step5 — Performance test involves sending an image to both the inference endpoints at a very rapid pace, and then plotting the results. One can clearly see more than 2X performance boost with Neo.

Model: image classification with 18 layers
Instance_type: ml_c5.4xlarge

Caution:

The performance tests for demo 1 were launched from the SageMaker notebook server. So the latency will include a small component for network transit time between the notebook server and the host server.

Also, p50 is typically a better metric than average latency. This is is because average can fluctuate a lot due to outliers.

Demo 2: Deploying a heavy model on Raspberry Pi using SageMaker Neo

Please take a look at the article by Julien Simon to understand steps involved in compiling and deploying a model on a Raspberry Pi device. Julien shares following observation from his performance test with Resnet 50 model.

“Vanilla model running on MXNet runtime takes 6.5 seconds and requires about 306 MB of RAM.”

“(with SageMaker Neo) this prediction takes about 0.85 second and requires about 260MB of RAM: with Amazon SageMaker Neo, it’s now 5 times faster and 15% more RAM-efficient than with a vanilla model”

NOTE: In this case, performance boost is dramatic. In every case, you mileage will vary, but typically performance gains will be significant.

Does SageMaker Neo cause loss of accuracy?

SageMaker Neo optimizes the execution of model computation, taking into account underlying hardware platform’s features and capabilities. It does not cut any corners. Hence, there is no loss of accuracy.

Conclusion

Converted models perform at up to twice the speed, with no loss of accuracy (we saw for ourselves with performance tests)
Sophisticated models can now run on virtually any resource-limited device, unlocking innovative use cases like autonomous vehicles, automated video security, and anomaly detection in manufacturing IoT applications.
Developers can run models on the target hardware without dependencies on the framework.

Train once, run anywhere with Amazon SageMaker Neo.

Related articles: SageMaker Neo announcement

Supercharge deep learning (AI) inferencing with Amazon Elastic Inference & Amazon SageMaker Neo (Part — 2)

Demoes

Does SageMaker Neo cause loss of accuracy?

Conclusion

Written by Girish