Running Pytorch models in production

Published in

Styria.AI Tech Blog

5 min readJan 7, 2019

Pytorch and Tensorflow are two widely used frameworks that have become today’s standard when it comes to deep learning. Tensorflow is Google’s child, released in 2015, and has been the most famous deep learning framework ever since, but now there is a new kid on the block. With its first stable version released a few weeks back, Facebook’s Pytorch is a deep learning framework that has won the hearts of many researchers and developers due to its simplicity, dynamic graphs, and overall more natural developing experience when compared to Tensorflow. When it comes to preferences, some may prefer Tensorflow, some Pytorch, and we’re not here to judge but to make the best out of both.

At Styria.ai, we have had a lot of experience with Tensorflow, utilizing it to prototype solutions, but also to deploy complete deep learning pipelines which can support thousands of requests per second. Our experience taught us that the Tensorflow deployment component, named Tensorflow Serving, proved to be the perfect solution for deploying our deep learning pipelines. Serving offers us some neat features including automatic model switching, integrated API, and blazing fast execution because it’s all written in an optimized C++ code.

During the past few months, we have been experimenting with Pytorch to train and evaluate our models. Anyone in our team who has tried developing in Pytorch has said it was superior to the Tensorflow development process. Pytorch code is more Pythonic and so simple to write that we have managed to recreate one of our categorization pipelines in just two weeks, a work that took months to do in Tensorflow!

Given the reasoning above, the conclusion was to use Pytorch for model development, training, and evaluation, while Tensorflow in the production (Pytorch has also become production compatible as of v1.0, but we haven’t tested it yet). The only problem remaining was the bridge between the two libraries, and here we opted for the ONNX, a universal format for deep learning models. ONNX enabled us to convert the Pytorch model to a frozen Tensorflow graph compatible with Tensorflow Serving. The full conversion script is here:

Script will convert the pre-trained AlexNet model to a Tensorflow Serving format.

The idea is to first convert the Pytorch model to an ONNX format, followed by the conversion from ONNX to Tensorflow Serving.

export_onnx is the function responsible for converting Ptorch models to a universal ONNX format. Most of the knowledge on the Pytorch -> ONNX conversion is here, so we won’t go into much detail, but we want to mention one detail not mentioned in the Pytorch documentation. Pytorch still does not support exporting ONNX models with a dynamic batch size so here we used a workaround to fix this deficiency. As Tensorflow itself supports dynamic batch sizes, the trick lies in replacing the batch dimension with an arbitrary string value. make_variable_batch_size iterates over first num_inputs nodes of the ONNX graph and replaces the first dimension with the string of value 'batch_size'. This will instruct the ONNX -> Tensorflow converter to put ? as the batch dimension, resulting in support for dynamic batches in the Tensorflow model. The Pytorch -> ONNX converter supports multiple inputs and outputs so we have also included code that handles this use case.

To convert the ONNX model to a Tensorflow one, we will use the onnx-tensorflow library.
Going through the code in more detail, you’ll notice how the conversion has an extra step we haven’t mentioned yet. The script will not make a direct ONNX -> Tensorflow Serving conversion, but will first convert the ONNX to a Tensorflow proto file. This extra step is required due to an inability of the direct ONNX -> Tensorflow Serving conversion in the onnx-tensorflow library. The final step is to generate the Tensorflow Serving format using the export_for_serving function.

If everything went successfully, you will see an output similar to this one:

INFO:__main__:Input info:
{'input_img': name: "input_img:0"
dtype: DT_FLOAT
tensor_shape {
  dim {
    size: -1
  }
  dim {
    size: 3
  }
  dim {
    size: 224
  }
  dim {
    size: 224
  }
}
}
INFO:__main__:Output info:
{'confidences': name: "add_8:0"
dtype: DT_FLOAT
tensor_shape {
  dim {
    size: -1
  }
  dim {
    size: 1000
  }
}
}

Here in the output, you can check if the input and the output dimensions match the expected values. Batch size dimension should be set to -1, indicating the dynamic batch size support.

Model inference test

If the conversion was a success, you will find a newly created directory in which the Tensorflow Serving model was stored. saved_model_cli is a handy script that comes with the Tensorflow library and it can be used to run model inference on the arbitrary input data. Our model expects a batch of normalized float32 images, stored in a form of a Numpy array and saved on a disk in the .npy format:

import numpy as np
x = np.random.randn(64, 3, 224, 224)
np.save('input_img.npy', x)

To test the model inference, run:

saved_model_cli run --tag_set serve --signature_def predict_images --dir /path/to/tf_serving_model_dir/ --inputs input_img=/path/to/input_img.npy

The output should be a list of confidences for all categories.

Gotchas

Although the conversion is possible, there exist some limitations on which we ran into while producing the conversion script.

Supported operations

Pytorch docs lists many supported operations stating this list is enough to convert some of the famous deep learning models such as: ResNet, SuperResolution, word_language_model, etc. The range of operations is quite large, however, if you need an op that isn’t supported, Pytorch offers custom op implementation if you have time and skill.

Tensorflow limitations

We have discovered that some pooling operations do not support the NCHW input type when the model inference is running on CPU. The remedy for this issue is not simple because Pytorch currently does not support the NHWC input format on many operations.

Another limitation is tied to how Tensorflow implements padding in pooling and convolution operations. Tensorflow is limited to two modes of padding, VALID and SAME . On the other hand, Pytorch can accept any form of padding values which results in the incompatibility among Pytorch and Tensorflow layers. There are two solutions to this issue:

Carefully pick Pytorch padding values so they match the padding values of the two supported modes in Tensorflow
Manually pad the feature map before using the pooling/convolution operation. Padding op is supported in both ONNX and Tensorflow but the trick is that Tensorflow can accept any value in the padding op. This will enable you to use any padding values in Pytorch while maintaining the compatibility with Tensorflow padding limitation.

Running Pytorch models in production

Model inference test

Gotchas

Supported operations

Tensorflow limitations

Written by Bartol Freškura