From Jupyter to Earth: An Example of ML Project Used in Real-World Using TensorRT
We often think about how Machine Learning comes into play when used on an industrial scale. Is it the same code that we run on our Jupyter notebook? Do we need to install Python and Tensorflow on every machine that is supposed to run this code? To learn this I decided to do a project that was industry ready, to create something that could be just downloaded, fed data and got the prediction no matter the machine or its configuration.
Now the goal was to create a simple classification model and convert this basic model into something that can be run on any computer without writing a single line of code. To do that, the code would need to be converted into a machine-level or an executable binary file.
Fortunately, Nvidia has created some APIs with TensorRT. TensorRT is a high-performance neural network inference optimizer and runtime engine for production deployment also known as the CUDA Engine. TensorRT optimizes the network by combining layers and optimizing kernel selection for improved latency, throughput, power efficiency, and memory consumption.
Now there are two possible ways to create a CUDA Engine, one is Python API (Not available for Windows) and the other is C++ API. Now before choosing the API, we need to understand the difference between Python and C++. While Python may be a very user-friendly language, it faces one major drawback when compared to C++ which is ‘Latency’.
Latency is not given a lot of importance when training a model in Jupyter Notebook. All that we care about is accuracy. But in the real-world Latency is a critical factor. Let’s take a simple example; A self-driving car. This car is supposed to be driven on streets and it has to make thousands of split-second decisions based on ever-changing parameters. In such cases accuracy is not the only problem, a slight delay in the model can cause a catastrophic result. Therefore a simple Python model is not suitable for this task. Although C++ is not a very user-friendly language, it does provide us with the required efficiency regarding the latency issues. So, To get the best of both worlds, we train a model in Python Tensorflow and then we create an engine based on that model in C++ which gives us the speed that we require.
- Create a simple classification model:
First, I created a simple Deep Learning classification algorithm using a generic dataset ‘fashion-MNIST’. After training it, I achieved a testing accuracy of 90% which was good enough for my end goal. Notice the time required for the model to run the test dataset.
We are going to keep the test images and labels in a separate directory. The images are kept in a .pgm format for the ease of operation.
2. Save the model as a frozen inference graph and then convert it to a .uff file:
Save the model as a frozen inference graph, a ‘.pb’ file. Now in order to transfer this model from Python to C++, we use a ‘.uff’ file aka universal file format. This format can be read universally by most languages.
3. Create a CUDA Engine using TensorRT C++ API.:
You can download TensorRT through this link. C++ API has three main stages; loading the pre-trained model, creating a CUDA Engine, Running the test dataset on to it. The codebase is quite similar for most cases and can be created by referring to the example cases in TensorRT. After running the Program a .exe file will be generated. This program can be run on any device.
Note: Check my Github for more info and how to do it.
4. Comparing the accuracy and latency of the TensorFlow model and CUDA Engine.
In order to get a better understanding and do a comparative study of the latency and accuracy of both the Tensorflow model and the C++ Cuda Engine, the model was tested on the fashion_mnist test dataset of 10,000 images.
An accuracy of 90% was achieved by the model and the latency was 10 seconds. Building a Cuda Engine using TensorRT’s C++API. In order to test the dataset in the C++ model, a Cuda Engine was built. We used the .uff file as weights for this engine.
After building the engine we fed it the same 10,000 images of the fashion_mnist test dataset and tested the model. An accuracy of 90% was achieved and the latency was 4.46 seconds
As you can see, a CUDA engine significantly reduces the latency of the program while keeping the accuracy almost the same. This is the reason why we use C++ in production level projects in order to get a robust system and efficient execution. While saving six seconds doesn’t sound much but when the tech being used is making hundreds of calculations each second, this difference is compounded significantly
For a detailed explanation and code base of the project, here is a link to my Github