How to Choose the Right Deep Learning Framework for Your Deployment: A Guide

Cyrus Behroozi
Mar 31 · 9 min read

Machine learning algorithms have become pervasive in our modern times. Anywhere you turn you will find some form of machine learning being used to solve complex problems: from big data analytics to weather prediction to face recognition. With this recent rise of machine learning, the number of deep learning frameworks — tools for creating abstraction from the math and statistics involved in machine learning — available to developers has also increased. With so many options, choosing the right framework for your deployment needs can be difficult.

In this tutorial, we will begin by discussing the important metrics to consider when choosing a ML framework. Then we will dive into each framework and reveal the results of our own benchmark tests which should help you when deciding which framework is best for your specific use case.

There are two main classes of frameworks: those which are feature-rich and have been optimized for training, and those which are fast and lightweight and have been optimized for inference. Today, we will be reporting the measured benchmarks for some of those CPU inference-optimized frameworks. For the uninitiated, here is a brief summary of the difference between training and inference:

Training: Training refers to the process of teaching a model to learn from the data it sees.

Inference: Inference refers to the process of using a trained machine learning algorithm to make a prediction.

In a production deployment, the aim is to maximize the inference speed while minimizing resource usage. The various inference frameworks have different implementations in the backend, therefore, it is necessary to benchmark the frameworks to determine which is right for your deployment requirements.

Benchmark Metrics

With CPU inference frameworks, there are several metrics one must consider when selecting a framework:

Latency refers to the time taken to process one unit of data provided only one unit of data is processed at a time. In simpler terms, latency is the time one has to wait to get the inference result. Low latency is critical when designing real-time systems. For example, a system processing a video streaming at 30 FPS must have a latency of less than 33.3ms in order to be real-time. This is often a challenge because more accurate models are generally larger and have more parameters, which consequently increases the inference time. Since CPUs have low-latency cores and ultra-fast cache, we generally aim to minimize latency as opposed to throughput on CPUs (more on throughput below).

Throughput refers to the number of data units processed in one unit of time. With respect to our video processing example, it would refer to the number of video frames we could process in a fixed amount of time for a batch size greater than 1. Batch size refers to the number of samples (or images in our example) which can be processed at the same time. We generally optimize for throughput on GPUs as they are massively parallel devices — a GPU core on its own is quite slow, but modern GPUs have thousands of cores capable of running in parallel. Here is a good resource if you want to learn why we optimize for throughput on GPUs instead of CPUs. Despite this, there are still ways to optimize for throughput on CPU that do not use batching, including the technique of running multiple inference instances in parallel while reducing the number of CPU threads each individual instance uses. More on this below.

This one is self-explanatory. Memory usage refers to the amount of memory or RAM used by running inference. We must be mindful of memory usage for runtime environments with limited memory such as embedded devices. Additionally, we don’t want our framework to hog all the memory so that we ensure some memory is left for other tasks in our pipeline (ex. video decoding). This metric is particularly important if we plan on running multiple inference instances in parallel, as each instance will generally not re-use the same memory.

This is related to the topic of latency and throughput. Inference frameworks generally do not run inference using only a single thread. They instead launch several threads to run different operations in parallel, consequently reducing the latency. In general, the more threads we use for inference (particularly with heavy models such as a ResNet100), the lower the latency. However, there is a limit to this. Eventually, the overhead of adding and managing new threads outweighs the provided increase in speed. Below, I’ve added a latency vs thread count chart for one of the machine learning pipelines we use at Trueface:

As can be seen, the graph follows an exponential decay pattern, with the greatest reduction in latency being experiences when moving from 1 thread to 2 threads. The significance of this is that we can actually enforce a reduced thread count in order to increase the CPU throughput. Consider an example where we have a CPU with 8 threads.

Scenario 1: Latency Optimized: In this scenario, we have 1 instance running inference using all 8 threads. Using the chart above, we can approximate the latency to be 75ms — given a single input image, inference can be performed in 75ms. Therefore, this scheme optimizes to reduce latency as much as possible. However, if the instance is provided with 100 input images to process, then it will take a total of 7.5s (100 images * 75ms ) to run inference.

Scenario 2: Throughput Optimized: In this scenario, we have 8 instances running inference using only 1 thread each. Using the chart above, we can approximate the latency to be 400ms — given a single input image, inference will be performed in 400ms. Although this seems like a bad tradeoff compared to scenario 1, scenario 2 shines when we have many input samples. If the instances are provided with 100 input images to process, then it will take a total of 5s (100 images * 400ms / 8 instances) to run inference.

By running more instances in parallel and reducing the number of threads per instance, we have increased the latency but also increased the overall throughput in scenario 2. Inference frameworks allow the developer to dictate the number of threads used for inference, so it is therefore an important metric to track.

Ultimately choosing to optimize for latency or throughput depends on the use case. For one-off inference such as using facial recognition for identity verification on a mobile device, you would want to optimize for latency so that the user gets feedback more quickly. For a use case such as scanning hundreds of hours of recorded video footage to find an individual of interest, optimizing for throughput is preferred.

This is an important metric if the software is to be shipped. At Trueface, we ship our inference code as part of a dependency-free compiled library that can run directly on the OS without requiring additional library installations. We must therefore be mindful of any dependencies which are incurred by adding the inference framework.

Benchmark Test Setup

The following benchmarks are running on unloadedDual Intel Xeon CPU E5-2630 v4 @2.20GHz (for those familiar with NIST FRVT, this is the same CPU used for their timing tests). The machine is equipped with 128Gb of RAM. The model used for inference is a ResNet100 model with 65.13 million params and 24.2 GFLOPs. The model input is a 112x112 RGB aligned face chip. The time it takes to read the image into memory and decode the image to an RGB image buffer is not included in the reported inference time. Any preprocessing required after this point to convert the decoded buffer to the expected framework-specific input is included in the inference time (including HWC to CHW conversions). The first inference time is discarded to ensure all network weights and params have been loaded. The number of inferences used to generate the average reported time is one the order of 1000.

Frameworks

In this benchmark test, we will compare the performance of four popular inference frameworks: MXNet, ncnn, ONNX Runtime, and OpenVINO. Before diving into the results, it is worth spending time to understand the contending frameworks.

MXNet is generally not considered a high-performance deployment inference framework, but instead a training framework; the library is feature-rich for training but consequently has a lot of bloat for a deployment framework. MXNet was included in this benchmark as the model used in the benchmark was trained using this framework so it is a good baseline measurement. In the benchmark, we use the Intel Math Kernel Library for Deep Neural Networks (Intel MKL-DNN) backend, which takes advantage of Advanced Vector Extensions 2 (AVX2) for performing Single Instruction on Multiple pieces of Data (SIMD).

This small and lesser-known library is powerful because, unlike the other frameworks, it is built as a static library and is dependency-free (other than requiring OpenMP). This makes it extremely easy to ship the library as part of your software. Although the library has been optimized for mobile devices and ARM CPUs, the library still boasts impressive results on X86_64 CPUs and keeps up with some of the larger frameworks. ncnn supports runtime CPU dispatching for AVX2 code paths.

ONNX Runtime, developed by Microsoft, offers the most backend acceleration options of any framework, including MKL-DNN, OpenVINO, and TensorRT (CUDA GPU acceleration framework). It also supports different levels of model optimization. For this benchmark, we use the default backend with the maximum model optimization level enabled. With respect to deployment, the pre-compiled library can be conveniently downloaded and linked dynamically at runtime. To learn more about the benefits of ONNX, check out this blog post written by one of my colleagues, a machine learning engineer at Trueface.

Developed by the Intel team, this framework specifically targets Intel hardware. The framework is not only able to run inference on CPUs, but is also capable of running inference on specialized devices such as Intel GPUs, Intel VPUs (Neural Compute Stick), and even Intel FPGAs. Although the library does have several runtime dependencies, they are easy enough to download using the apt package manager on Ubuntu and bundled dependency installation scripts. Be mindful that in order to run inference with OpenVINO, the model must be converted to an Intermediate Representation (IR) format which may cause a slight difference in accuracy from the original model. For this benchmark, we use the MKL-DNN backend plugin.

Results

You can find the full inference code, library build scripts, and comprehensive results at the following github repository:

The following graph plots the latency against the number of threads used:

Latency vs number of threads

The following bar chart illustrates the latency of the contending frameworks when we focus on 8 threads:

Latency at 8 threads

Both charts above show that OpenVINO had the lowest latency at all thread counts. Additionally, OpenVINO could be used to achieve the highest throughput. With respect to our examples above, OpenVINO performed the best in both scenario 1 and scenario 2 and would thus be the correct choice of framework.

The following chart summarizes the memory usage, reported as the maximum resident set size:

Memory usage in Gb

ncnn used significantly more memory than the other frameworks. At the time of writing this blog, I opened an issue on the ncnn GitHub page to see if the library developers had any comments on this. You can follow the progress of that issue here.

Limitations

  1. The benchmarks above are running on an Intel CPU. Unfortunately, we cannot assume inference framework performance or their respective rankings will be the same on AMD CPUs. Intel has admitted to throttling some of their libraries including their Math Kernel Library (MKL) for non-intel CPUs. I ran a few tests on AMD CPUs — reported at the bottom of the linked Github repository — but further investigation on non-intel CPUs must be performed.
  2. ONNX Runtime and OpenVINO both modify the model. In doing so, the output and accuracy of the model are slightly different. Although the loss in accuracy appeared to be minimal during my testing, a more comprehensive test must be performed with a large dataset to quantify the exact loss in accuracy (perhaps by generating a Detection Error Tradeoff curve).

We have defined the important metrics which are critical to understanding when deciding on an inference framework. As stated previously, in a production deployment, the aim is to maximize the inference speed while minimizing resource usage. We have also outlined the differences between latency and throughput and made note of why each is important when choosing a framework. Finally, we covered the popular inference frameworks before putting them through our own benchmark testing.

Congratulations! If you made it this far, you are well equipped to prioritize certain metrics and choose an ML framework that is best for your use case.

Trueface

For a safer and smarter world.