Faster, Cheaper, Leaner: improving real-time ML inference using Apache MXNet

Olivier Cruchant
Apache MXNet
Published in
9 min readJan 7, 2020

Machine learning models are data transformations learned over empirical examples. While a lot of research and tooling are now available regarding how to properly train models, deployment of models for inference — the phase during which the trained model artifact is exposed and put to use — is a lesser-known domain and is developing fast as more and more use-cases are shipped to production. Also, from a more holistic standpoint, it can account for up to 90% of the compute costs of a deployed machine learning application since it consists of the phase during which the model is actually used by end users or downstream systems. In that regard, it deserves specific attention.

While reducing inference footprint and its cost is a paramount concern for model owners, inference speed is also a common priority in use-cases requiring real-time model answers over dynamic streams of data, such as fraud detection, advertising, transit time prediction or production line monitoring. In this post, we go over the possible optimization techniques that contribute to reducing model inference latency, hardware footprint and costs.

It is exciting to see the fast development of commercial services and open-source tooling supporting inference optimization. While the sheer volume of innovation in the space can sometimes feel overwhelming, a simple mental model to understand those optimizations techniques and tricks is to cluster them along three orthogonal dimensions: (1) hardware, (2) software and (3) scientific logic of the deployed algorithm. Eventually, we will also discuss architectural consideration that can contribute to further speed and cost optimizations such as caching, serverless deployment and model locality.

The three dimensions of inference optimization. Hardware and software optimizations usually do not disturb model accuracy. On the other hand, changes to the model algorithm usually alter its accuracy, hence the dashed line in the figure above
  1. Hardware optimizations

A usually straightforward technique to reduce model latency is to use an appropriate hardware back-end. You can turn those bolts in two directions:

Horizontal scaling — There is usually a bidirectional causal relationship between server load and response latency, where latency gets degraded as server load increases, and server load increases if the model takes longer to run. In that regard, to keep latency predictably low, increasing the number of server nodes is a first technique that can be done manually but also programmatically. For example, the Autoscaling feature of Amazon SageMaker endpoints enables you to configure scale-out and scale-in policies based on a metric of your choice, typically using the pre-defined Amazon Cloudwatch metric InvocationsPerInstance. Determining appropriate thresholds for the scaling policy can be done via load testing, a best practice that can be part of the model deployment cycle. Note that some forms of load testing qualify for the Amazon EC2 Testing Policy. An AWS ML Blog post presents the details of SageMaker Autoscaling and end point load testing.

Vertical scaling & hardware specialization - Another axis of hardware back-end improvement is to right-size the endpoint instances and choose appropriate compute host type and AI accelerator, if relevant to the model. The concept of AI accelerator usually refers to the class of microprocessors designed for AI applications. NVIDIA GPUs are a common option for AI acceleration, most notably the recent Volta and Turing architectures that feature the Tensor Cores, which are specialized cells for mixed-precision matrix multiply-accumulate are found in the AWS P3 and G4 families of Amazon EC2 instances respectively. Notably, other AI accelerators are being made available, such as Amazon Elastic Inference (EIA) and AWS Inferentia. Every model, ML framework and serving stack has its sweet spot when it comes to choosing the right mix of RAM, CPU, accelerator memory and accelerator compute capacity. Appropriate load testing will reveal which instance is more appropriate for your use-case. The use of AI accelerators, such as GPU or Amazon Elastic Inference, usually accelerates the inference.

Model latency of identical Apache MXNet ResNet18 artifacts deployed in Amazon SageMaker in 3 different endpoints in region eu-west-1: in green a single-instance ml.c5.2xlarge endpoint ($0.538/h); in orange a single-instance ml.p3.2xlarge ($4.627/h) endpoint ; in blue a single-instance ml.t2.medium + ml.eia1.medium Elastic Inference Accelerator device ($0.07/h + $0.196/h = $0.266/h). All endpoints receive an identical traffic of identical pictures sent at a rate of 26 pictures/minutes. Average latency in the EIA-accelerated endpoint is ~55ms, about 28% faster than the full CPU endpoint, while 51% cheaper! A full GPU endpoint gives the best latency in that situation (~33ms), but is 17 times the price per hour (+1,639%) of the EIA-accelerated endpoint.

An interesting comparison was recently made by a customer of AWS, Curalate, in the AWS ML Blog, comparing inference of various vision models in AWS Lambda back-end (serverless, CPU) vs Amazon EC2 with Amazon Elastic Inference acceleration. Researchers from Curalate were successful in running Apache MXNet vision classifiers inference in Amazon Lambda, and found for various models thresholds of traffic below which it AWS Lambda deployment would be cheaper than when using always-on EC2. For example, they measured that in their setup, a ResNet 152 deployment would be cheaper than an EC2 + Amazon Elastic Inference-based deployment for traffics up to 7.5MM calls/month. They also mention that leveraging Amazon Elastic Inference led to cost savings on hosting between 35% and 65% compared to regular Amazon EC2 GPU instances.

2. Software optimizations

Another axis of improvement, separated from the hardware and algorithm considerations, is to work on the software stack, itself composed of two sub-parts — server and compilation.

The model server is tasked with loading one or several models into memory and exposing them through a web service. Different servers will handle load at different levels of performance and provide various levels of convenience. A common trick to maximize hosting hardware utilization is to leverage batching. For example, in some situations instead of calling your server 100 times, with 100 individual requests, you can call your server 1 time with a batch of 100 requests. Some ML problems such as ad scoring or recommendation naturally lends themselves to batching, as you can for example score relevancy of all products in a catalog in one model call. Specialized deep learning servers such as TensorFlow Serving, TensorRT Inference Server or Multi Model Server (MMS) are actually able to accumulate incoming requests into batches to improve GPU parallelism, a practice usually called as micro-batching. While micro-batching can improve inference economics, it usually comes at the expense of latency, since incoming requests will wait a varying amount of time while the micro-batch is assembled. This logic is handled by the server, and servers with micro-batch enabled usually come with controls to tune maximum acceptable latency and micro-batch size. While MMS is not specific to Apache MXNet, it was built with strong native integration with Apache MXNet and features a number of functionalities specific to ML model serving, such as logging, model pre-loading, micro-batching and multi-model serving.

Another axis of software-based improvement is model compilation, a post-training step that consists of creating a model artifact and a model execution runtime finely adapted to the underlying supporting hardware. This concept leverages the fact that when you know where you’re going to deploy (e.g., NVIDIA, Intel, ARM), you have an insider edge and you can refine the model graph and inference runtime to better leverage available low-level specifics. This can reduce memory consumption and latency by double-digit percentage, and is an active area of ML research as illustrated by the papers mentioned in the rest of this paragraph. A popular, flexible and high-performance option for compilation is Tensor Virtual Machine (TVM, Chen et al, paper, documentation), which demonstrated compilation from a large number of frameworks (MXNet, Keras, Caffe2, CoreML, TensorFlow,…). Other notable compilers are for example treelite (paper, doc) for decision tree compilation, and the Glow compilation engine for PyTorch. Interestingly, both treelite and TVM are dependencies of the AWS compiler SageMaker Neo, which provides a managed experience of compilation.

CPU and GPU benchmarks of TVM. Source: “TVM: An Automated End-to-End Optimizing Compiler for Deep Learning”, Chen et al. https://arxiv.org/abs/1802.04799 Last revised 5 Oct 2018.

One benefit of using Amazon SageMaker Neo is that it is compatible with the most popular frameworks (MXNet, TensorFlow, XGBoost, PyTorch) and hardware back-ends (Intel, NVIDIA, ARM and others), and that it chooses on your behalf the underlying compiler tricks most appropriate for your combination of hardware target, framework and model.
Model quantization refers to reducing the number of bits representing model weights. It is usually available as part of a compilation stack or as a model framework component (examples of quantization functions in TVM, TensorFlow Lite, Apache MXNet or NVIDIA TensorRT). Quantization aims to reduce model storage footprint and computation time, and is most commonly a post-training step. Another benefit of quantization is that it can leverage specific low-precision optimizations, such as the Vector Neural Network Instruction (VNNI) on modern Intel Xeon Scalable CPUs. Quantization is most commonly ran post-training, with an optional calibration step during which the model sees sample data to properly quantify the ranges to be quantized. Quantization may also be run in-training, a concept named quantization-aware training that involves directly training a quantized representation. At the time of this writing, this seems to be still at active research stage and less a popular option than post-training quantization. Notably, in summer 2019 the TensorFlow group encouraged in a blog post the use of post-training quantization over quantization-aware training.

3. Algorithm optimization
On a given hardware and software platform, model inference latency and hardware footprint will significantly vary with the algorithmic logic of the model. If speed, throughput or costs are your priorities over accuracy, and if you already exploited all possible tricks along hardware and software axis above, consider using models adapted to those goals. For example:

Be aware that unlike hardware and software optimization axis, changing model algorithm can significantly alter accuracy.

GluonCV Classification Zoo (top) and Detection Zoo (bottom) feature a great display of the accuracy vs throughput trade-off. GluonCV (https://gluon-cv.mxnet.io/) is a compact python toolkit providing access to state-of-the-art vision models and tools for a number of computer vision paradigm. (Guo et al)

4. System Architecture Considerations
Beyond improving the model inference functionality, other architectural concepts can contribute to reducing latency and costs of an end-to-end real-time machine learning prediction pipeline:

  • Caching: if you expect a high level of prediction requests with the same input, it is worth considering creating a cache layer in front of the model. A cache can store model input-prediction pairs during an appropriate time-to-live, so that a prediction request over an input that was seen already can be fetched from the cache at a lower latency and cost than re-running the model prediction. Typical relevant options for this layer are fast key-value stores such as Amazon DynamoDB and Redis.
  • Locality: localizing the model as close as possible to the consumption point of the prediction may contribute to significant latency improvement. For example, running one model.predict() prediction call on a vector of 127 records with the Scikit-Learn random forest regressor of the public SageMaker Scikit-Learn integration demo notebook takes about 9ms of local execution time on a single-machine ml.c5.large Amazon SageMaker endpoint, but takes 25ms when model serving, serialization/de-serialization and network communication time and are taken into account (from the endpoint to a client instance making the call from within the same region and the same account - both numbers are averages over 10 invocations spaced by multiple seconds). Footprint optimizations mentioned in 1, 2 and 3 above are especially relevant when models are deployed on edge hardware with limited power supply and computing capacity.
  • Serverless deployment: For models facing an unpredictable traffic and not requiring GPU a serverless compute engine - for example AWS Lambda - can be a relevant inferencing platform. The pay-per-inference format makes it economical for reasonable amounts of traffic. It also comes with the benefit of simplifying the deployment stack by removing the model server layer. Numerous blog posts show its successful use for inference of deep MXNet models:
  • Serving deep learning at Curalate with Apache MXNet, AWS Lambda, and Amazon Elastic Inference
  • Build, test, and deploy your Amazon Sagemaker inference models to AWS Lambda

In summary, there are multiple axis to explore to reduce the latency, cost and hardware footprint of your real-time model deployments, and many of those ideas are orthogonal and can be combined. Apache MXNet provides a growing list of tools supporting all flavors of software-based inference improvement, from model architectures to web serving, including compilation and quantization. Please do not hesitate to contribute to those projects and reach out to the community on the forum discuss.mxnet.io/!

Many thanks to Jonathan Taws, Chris Fregly, Will Badr, Tanya Roostaeyan, Madhuri Peri, Frank Liu, Jonathan Chung and Nikolay Ulmasov who provided helpful reviews on the topic

--

--

Olivier Cruchant
Apache MXNet

Python, pandas, and actual wildlife - Views are mine and do not necessarily reflect those of my employer.