Deploy AutoML protobuf model on NVIDIA Jetson

Published in

[:RI:] @ REWE digital

7 min readAug 23, 2019

After training a classification model with AutoML Vision, Google offers multiple options to deploy it. At the time of writing, those are TF LITE for mobile devices, CONTAINER for x84_64 workstations, TF-LITE quantized for edge-TPUs, Protobuf for workstations and GOOGLE CLOUD for online predictions. However, we did plan to run the protobuf model as a prototype on a NVIDIA Jetson platform and local workstations with the same code base. Here, Google’s container solution is not feasible, since there is no container for the Jetson’s architecture available. We experimented with multiple solutions and came to the conclusion, that a native Python application suits us best.

Google did not provide much documentation at this point to run this specific model with TensorFlow, so we created our own solution. We hope this helps others to get running prototypes faster and understanding the AutoML models better without going into too much detail. The implementation of the application can be found at https://github.com/r-i-rewe-digital/automl-tutorial. The code comes with an example model for cat and dog classification.

Setup the Environment

On a Jetson, there is not much setup needed. We were testing on a system with JetPack 4.2 and TensorFlow-GPU 1.13.1 (available here) installed. The only additional dependency you need is flask. When you are running on your local workstation you can use pip to install the dependencies.

cd <path to repo>/automl-tutorial
pip3 install -r requirements.txt

In case of a working NVIDIA and CUDA setup, you can also install

tensorflow-gpu==1.13.1

to speed up the model prediction. If you are not familiar with this setup, it is strongly recommended to skip this step at this point.

Implement Flask Application

We first create a minimal flask application in prediction_controller.py, which provides two REST endpoints. In particular, one POST endpoint to make a prediction with the trained model and one GET endpoint to receive all string labels.

Let’s take a look at the POST part. All data required for the prediction is send in the request body. The response provides the labels and the prediction scores, ordered by the highest score. This is a straight forward approach that does not contain any error handling.

Flask POST endpoint

The request body of the POST call contains an identifier, the base64 encoded image (as string) and the optional parameter prediction_results to limit the amount of returned results. This comes in handy if you have many labels and if you are only interested in the highest scores. The content type for the request is application/json.

{
  "identifier":"your-identifier",
  "prediction_results": 1,
  "base64Image":"base64encodedImage"
}

Implement Prediction Service

To run a prediction within a TensorFlow session, you need to know the properties of the input and the output. If you have ever worked with a classification CNN before you already have an idea on how the input and output will look like. It is common to use the different channels of an image (RGB) matrix as input format. The shape would be something like a 608x608x3 tensor (ignoring the batch size). The output would be a float array, which contains the prediction scores. The export of the TF LITE model provides a label file and a metadata file, which describes the model input and output. The input looks very much as expected.

{
  "batchSize": 1,
  "imageChannels": 3,
  "imageHeight": 224,
  "imageWidth": 224,
  "inferenceType": "QUANTIZED_UINT8",
  "inputTensor": "image",
  "inputType": "QUANTIZED_UINT8",
  "outputTensor": "scores",
  "supportedTfVersions": [
    "1.10",
    "1.11",
    "1.12",
    "1.13"
  ]
}

We can see the input tensor, output tensor and get information about the input format. Next we check the operations of the graph from the model that was exported as a protobuf file, not as TF LITE.

Print all operations in the graph

Unfortunately, the input/output tensors that are described in the TF LITE model do not exist in the protobuf model. To figure out what operations of the graph are the correct input and output, we need to take a closer look at the model. We opt to analyze the graph with Tensorboard.

Visualize the TensorFlow session with Tensorboard

At first glance it is hard to understand the session’s graph. To alleviate, we go through the graph step by step to better understand the model.

Let’s begin with most important section of the graph, the mnas_v4. This is the feature network that was designed by Google as start-of-the-art classification network (https://ai.googleblog.com/2018/08/mnasnet-towards-automating-design-of.html). If you are familiar with classification models you might have read about it already. Other alternatives are ResNet, VGG, AlexNet, Inception etc. We now know what CNN is used as feature network. We do not discuss the MnasNet architecture in detail here.

We need to find the input and output operation that we want to apply in the session. The input tensor to the graph seems to be the ‘Placeholder’ operation, which has a string as input format. This is very different from the expected 224x224x3 format. Instead of preprocessing the image and using the different channels as input, preprocessing is moved directly into the ML model.

You simply input the whole images as bytes. Preprocessing is performed in the map node and should only be added to prediction models, not when training the model. For training, you expect that preprocessing has already been done to save resources. We do not take a closer look in the map block here. It provides internal loading and decoding of images. If you are interested in details you can run Tensorbard on the example code and check it for yourself. For the moment we are satisfied that we know why the input shape is different than what we expected.

Now we investigate the output format. The scores tensor from the TF LITE model is available, but the labels file is missing. So how do we know how to interpret the score? It can be seen that the score tensor’s output is processed even further and tiled with something that comes from ‘Const_1’. Taking a closer look at the operation, it can be seen that the constant contains the labels as string. The modification to the input and output format allows to deploy the model all by itself, without any knowledge of input formats or output labels. You can easily replace the model file with an arbitrary model without having to change any configuration or code. This makes it very robust for production scenarios.

We now know the input and output tensor and can start to build the prediction code.

Implement the Prediction

There is little left to do at this point. We create the session, load the protobuf file and define the input and output tensors. Because we need to retrieve the labels only once, we can run the session to get them at the beginning. This saves time when we do the prediction later.

Prediction Service class

The prediction itself is even simpler. The service constructor took care of all the model loading and session creation. All we need to do now is run the image through the TensorFlow session and return the prediction results.

run the prediction with your TensorFlow session

Run the Application

Now that we have everything setup, we can run the application and start sending prediction requests. Use a REST client like Postman or curl to send requests. The repository contains one example request containing a dog.

curl -H "Content-Type: application/json" -X POST -d @example-request.json http://localhost:5002/api/predict

The result is then printed in a human readable form and can be processed by your other applications.

{
  "identifier": "cat-dog-example-01",
  "labels": [
    "dog"
  ],
  "scores": [
    0.8664201498031616
  ]
}

Conclusion

AutoML is a great tool for everyone who does not have practical experience with CNNs. It provides easy access to creating fast and high quality classification prototypes. Especially the UI to manage your dataset is an enormously helpful tool, which on its own would make a great addition to the Google Cloud stack. While there are currently only a limited number of supported platforms, we expect fast growth once the product progresses further through the beta stage. Unfortunately, AutoML currently allows for very little customization, thus limiting its applicability to real production problems. Especially for classification, higher input than 224 x 224 might drastically boost the performance of your model. Nevertheless, we find that building a first prototype with AutoML and replacing it later is a suitable approach to tackling ML challenges.

We are excited to see where this product is heading to in the near future.

Special thanks to Rutger Bezema and @Joscha Foßel for proof reading this work.