Choosing an AI Processor for Deployment Performance

Toby McClean
The Startup
Published in
5 min readJan 26, 2021

The next decade promises to transform every aspect of our lives. For example, there is a high probability that children born in 2020 and beyond will never learn to drive a car because of autonomous driving cars. This transformation is happening because of the maturity of three converging technologies — Artificial Intelligence (AI), the Internet of Things (IoT), and 5g.

Together, these technologies allow for accessing, processing, analyzing, storing, and moving unprecedented volumes of data. More and more of the devices surrounding us will become invisible as they move from smart devices to intelligent devices interconnected from the edge (where the device lives) to the cloud and back.

But we still have to acknowledge the Laws of Physics, and with the increasing volumes of data being produced, we won’t be able to send it all to the cloud to be processed and analyzed. Instead, we will have to bring some of the processing and analysis to the device or closer to the device. This is where AI will meet Edge Computing, which we refer to as Edge AI.

There already exists an ecosystem of innovation around Edge AI, which has shown significant progress in the areas of:

  • Hardware for processing AI models;
  • Software for running AI models; and
  • AI models optimized for the Edge.

But deploying AI models that can take advantage of these innovations is one of the biggest challenges for most Industrial or Operational solutions, including AI. And as the pace of innovation continues to accelerate and we develop new hardware, software, and models, the challenge will only get more complex.

A deployed model can be defined as any unit of code that is seamlessly integrated into a production environment and can take in an input and return an output. (https://towardsdatascience.com/why-is-machine-learning-deployment-hard-443af67493cd)

This report focuses on one particular challenge: choosing the Edge AI hardware and software (specifically inference engine) that satisfies a solution or systems requirements. And given just these two dimensions, there are a large number of possible selections when building a solution. The AI models are often very deep (especially in Computer Vision solutions), which means that using them for inference takes up a lot of computing power. Usually, we want our algorithms to run fast while fitting into a power and cost budget; for many users, that can be an obstacle.

Determining which hardware and software combination is best can take significant time and resources, so the tools described in this report will help narrow the selection process and look at the most relevant hardware and inference engine.

While we cannot wave a magic wand and tell which Edge AI hardware and software combination is the optimal choice for a given solution, we can provide a toolbox that guides the selection. As previously alluded to, such a toolbox will help narrow the selection process so that the evaluation process can focus on the most relevant combinations based on the solution’s requirements.

The first tool in our toolbox helps with the selection by examining the performance requirements of the solution. Future tools will help with the selection by looking at the power and cost budget requirements.

There are two views on the performance of the different AI accelerators from Intel and Nvidia. The first view looks at the maximum number of frames that can be scored per second (Inferred Frames per Second or IFPS). And this tells the throughput that we can achieve for a particular trained model on that processor. The second view looks at the time that it takes to process one frame. This tells is the latency we can expect due to the scoring of a frame, which is important since many real-time applications will have a total budget of time from when the frame is acquired to when action is taken based on what is seen in the frame.

Depending on the solution, it may be throughput, latency, or both that are relevant. And understanding these requirements is an important first step in the selection process.

With the data provided, what questions can be answered to help with the hardware and software selection challenge? Here we discuss several scenarios, but this is just an example.

Scenario 1

The first scenario deals with a classification model, Inception v4, trained on a GTX 2080 Ti. The deployment solution requires that we retain the accuracy achieved during training with a throughput of 15 FPS.

Example of iFPS performance data for Inception v4

As you can see from the graphs above, there are quite a few platforms that satisfy the minimum of 15 FPS throughput, which means that the power and cost dimensions must also be considered in this scenario to reduce the number of candidate platforms.

Scenario 2

The second scenario deals with examining the throughput impact for MobileNet v2 (batch size of 1 and FP32 precision) of switching the software from TensorFlow to NVIDIA TensorRT.

Example of iFPS performance data for MobileNet v2 using a batch size of 1 and FP32 precision.

In summary, except TX1, there is at least a doubling of throughput for MobileNet v2 by switching from TensorFlow to TensorRT.

Scenario 3

The third scenario examines the impact of batch size on latency for AlexNet v2 on a P2000.

AlexNet v2 running on a P2000

By using a batch size of 8, we can achieve a 3.5 times reduction in latency.

As you can see from this article, it is important to use the tools described to select the right hardware to solve your problem.

--

--