Supercharge deep learning (AI) inferencing with Amazon Elastic Inference & Amazon SageMaker Neo (Part — 1)

8 min readFeb 14, 2019

Introduction:

This is the first article in this series that focuses on deep learning inference performance. We will not only review solutions from AWS, but also conduct performance tests to see results for ourselves (seeing is believing!).

At re:invent 2018, AWS announced two new products that help you deploy your deep learning models in production with superior inference performance. They take two different approaches to improve performance.

Amazon Elastic Inference (EI) is a hardware based approach. In this approach, AWS provides a way to attach GPU slices to EC2 servers as well as SageMaker notebooks & hosts. GPU slices provide better performance without the cost overheads of using full GPUs.
Amazon SageMaker Neo is a software based approach that optimizes models for target platforms to take full advantage of their underlying hardware capabilities. Additionally, SageMaker Neo has also been open sourced for the community to contribute to. This will hopefully make the product feature rich and sustainable in the long run. There is so much innovation that is yet to happen in this area!

In this article series, we review both of these approaches. The first article focuses on Elastic Inference (EI). The next will focus on SageMaker Neo.

The notebooks developed for performance tests are available on github.

If you care about deep learning inference performance, but are new to SageMaker, I recommend seeing this short video.

Why deep learning inference performance matters so much?

Imagine your Alexa took 2 seconds to respond to you. This would be a total non-starter, isn’t it?

Amazon Echo device image, used for illustration of the idea.

For most AI applications in the cloud, a lot of time is lost in the network roundtrip over the internet. You can’t help it much. So you must make your model inference systems as fast as possible.

Inference for deep learning models is characteristically slow, compared to traditional ML models like random forests.

This is due to millions, and sometimes billions of mathematical operations that models have to perform to arrive at a prediction. Below table lists number of operation involved in some popular computer vision models optimized for mobile phone deployments. As you can see, as the number of Multiplication and Addition operations (MACs) grow, latency increases.

More MACs (i.e. Operations) means more latency (source)

Micro-services architecture puts additional demand on performance.

Many companies want to deploy deep learning models in micro-services environments where every single milli-second of latency counts. Below is a sample deployment environment diagram. As you can see, multiple services participate in answering any given request. Hence, it is extremely important for models to perform predications at blazing speeds if they are deployed inside micro-services.

If your model is buried deep down a micro-services stack, you have only few milli-seconds to perform inference.

Why inference cost matters?

Most companies will typically end up spending as much as 90% of all their AI resource budgets on deployments of models to production and on-going inference. Hence, faster, more efficient inference means $$$$$$$$ saved.

Life before Elastic Inference and SageMaker Neo

Before we dive deep into Elastic Inference and SageMaker Neo, let us quickly review other avenues to achieve faster inference performance.

Following types of options may be available to you sometimes, but you can’t always count on them as they may not be applicable for your use cases.

Choosing the right architecture:

For example, MobileNetV2 is faster by design than MobileNetV1.

If you fix accuracy expectation at 50%, you will notice MobileNetV2 is almost twice as fast as MobileNetV1. (source)

2. Trading off accuracy by using smaller models (fewer layers, for example)

You may review results of below CNN benchmarks. You will observe that within a model family (e.g. Resnet) with higher number of layers, you get relatively better accuracy, but at the cost of slower performance. Hence, sometimes you may choose to sacrifice accuracy for better performance.

Stanford DAWN Deep Learning Benchmark (DAWNBench) ·

DAWNBench is a benchmark suite for end-to-end deep learning training and inference. Computation time and cost are…

dawn.cs.stanford.edu

jcjohnson/cnn-benchmarks

Benchmarks for popular CNN models. Contribute to jcjohnson/cnn-benchmarks development by creating an account on GitHub.

github.com

Amazon Elastic Inference

Deep learning models can be notoriously slow without GPU acceleration.

Problem 1:

During training process, samples are batched, taking full advantage of parallelism that GPUs have to offer. However, during inference scenarios you will either have batch size of 1 (that is one inference at a time) or you will have a much smaller batch size compared to training. Thus, you will end up wasting most of your GPU power.

At batch size of 1, you waste up to 90% of GPU power

Problem 2:

Also, in many cases CPUs saturate much before they can load GPUs 100% with work. This is due to all the feature processing that is needed to be done on CPUs before inference can be performed on GPU.

Hence, GPU power getting wasted is almost a rule rather than an exception in deep learning inference.

Amazon Elastic Inference solves this problem by providing “slices” of GPU power referred to as “EI Accelerators” that can be attached to an EC2 server or SageMaker notebooks or hosts of your choice. This allows you to only pay for the amount of GPU power you are actually able to use, while benefitting from super fast performance.

In many cases, EI Accelerators allow you to save as much as 75% of cost, while benefiting from superior performance of GPUs.

Architecture:

To benefit from EI, you would need to use

Amazon EI enabled TensorFlow Serving
Amazon EI enabled Apache MXNet
Applied using Apache MXNet

These framework versions are available on latest DL AMI, as well as you can use them easily on SageMaker through available containers. Additionally, SageMaker built-in Image Classification Algorithm and Object Detection Algorithm support EI.

If you are using EI Accelerator from an EC2 instance, you do need to create a private link to EI Service. However, if you are using SageMaker, it is just a matter of making a configuration setting to use EI Accelerator with your notebook server or host endpoints(we will see a demo soon).

Deployment of EI with EC2. You need to set-up a PrivateLink

With SageMaker, it is just a matter of specifying the configuration option.

El Accelerator configurations:

AWS currently offers EI Accelerator in three configurations. This service is available in N. Virginia, Ohio, Oregon, Dublin, Tokyo, and Seoul.

Given the fact that p3.2xlarge has a Tesla V100 GPU and costs 3.060000$ for Linux On-demand and provides 15.7 TFLOPS of FP32 performance; these prices of EI Accelerators start looking like a good bargain. More so, because in inference use cases, most likely you won’t be able to use full performance of a GPU.

Demo: Set-up of EI Accelerator on SageMaker & performance tests

In the notebook (link here), I set-up two SageMaker endpoints for a Resnet 18 model. One endpoint is without EI Accelerator and another endpoint is with EI Accelerator. As you can see, adding EI Accelerator is just a matter of adding a setting to endpoint configuration.

SageMaker endpoint configuration without EI Accelerator

SageMaker endpoint configuration with EI Accelerator

Simple Performance tests:

I then run performance tests against both of the SageMaker endpoints and plot results. Performance test involves sending an image to both the endpoints for predictions repetitively at a very rapid pace. Please take a look at the notebook to review the code (git — link).

For multiple bursts, we can see that with EI Accelerator latency performance is consistently better.

Model type: Image classification with 18 layers
InstanceType: ml.m4.xlarge
AcceleratorType: ml.eia1.large

Caution about these performance tests:

The performance tests were launched from a notebook server that I had created. So latency numbers will include a small component for network transit time between the notebook server and the host server.

Also, p50 is a better metric than average latency. This is is because average can fluctuate a lot due to outliers.

Practical considerations:

What about the network round-trip from the server to EI Accelerator? Since both the server and EI Accelerator are in the same AWS region, in most cases, this round trip latency would be much smaller than performance gains achieved by using GPU power through EI Accelerator.
In what cases Elastic Inference may not give users a better performance? Presumably, if the model is so small and fast that GPU offloading is not needed, you may see observable impact of network round trip latency.

Is EI a solution for Deep Learning training?

No. It is purpose build solution for inferencing only. For deep learning training consider using P3 or P2 instance families that have local GPUs. In training scenarios, if you design your pipeline well, you can utilize the full power of local GPUs on the servers, thus avoiding any resource wastage.

Conclusion:

Elastic Inference (EI) offers great performance improvement for deep learning model inference (though, the mileage will very from case to case).

You have the freedom of attaching EI Accelerator to servers of your choice. Thus, you can ensure that your servers, as well as EI Accelerators are fully utilized.

Also, costs are significantly lesser compared to using full GPUs (with full GPUs, most of processing power will anyways be wasted in case of inference applications).

Deployment is also very easy.

In the next article in this series, we will review SageMaker Neo. You will find links to more notebooks that have examples for Tensorflow and MXNet at the bottom of the page.

Amazon Elastic Inferencing looks very promising for getting cost effective GPU power for deep learning inferencing needs.