Optimizing TensorFlow Models for Serving

In the world of machine learning, a lot of attention is paid to optimizing training. There is a lot less information out there on optimizing prediction. Yet serving models for prediction is where we make our money in ML! Serving performance can have a significant impact on the value of ML for your use case. Indeed, the cost of serving predictions may be a major factor in the total return-on-investment for an ML application. In this post we will show you some ways to optimize TensorFlow models for serving predictions, to help you reduce the cost and increase the performance of your ML solution. This work was done in collaboration with my colleague on the Google Cloud Solutions Architect team, Khalid Salama. The code for this post can be found here on github.

Latency (and size) matters

When it comes to optimizing models for serving, we care primarily about three things:

  • Model size
  • Prediction speed
  • Prediction throughput

In serving ML, model size matters. Of course smaller models use less memory, less storage and network bandwidth, and they load faster. In some cases hardware memory constraints or service limitations may impose a limit on model size. For example the Machine Learning Engine service on Google Cloud sets a default size limit of 250MB for models. When we use hardware acceleration for prediction, we need to make sure our model fits within the memory of the acceleration device. Model size has a particular impact in situations where we are serving the model on an edge or mobile device with limited capabilities. We want the model to download as fast as possible, using the least amount of network bandwidth, and take up as little memory and storage footprint as possible.

Prediction speed is another metric we care about for serving. When we perform our predictions online, we typically want results to be returned as fast as possible. In many online applications, serving latency is critical to user experience and application requirements. But we care about prediction speed even when we process our predictions in batch. Prediction speed has a direct relationship to the cost of serving, since it is directly related to the amount of compute resources necessary to make a prediction. The time it takes to make a prediction will always be a critical variable in any formula that measures prediction throughput. Faster predictions means more prediction throughput on the same hardware, translating into reduced cost.

Prediction throughput is a measure of how many predictions our system can perform in a given slice of time. Apart from prediction speed as just mentioned, other system attributes come into play to determine throughput, including batching of predictions, hardware acceleration, load balancing and horizontal scaling of serving instances. We will not discuss techniques to optimize prediction throughput in this article beyond the optimization of the time of prediction for a single input example.

Model Formats in TensorFlow

In its relatively short life, TensorFlow has managed to accumulate several model serialization formats. Discussion of all these formats is beyond our current scope; this page in the TensorFlow documentation provides a nice summary of them. The most important ones to know about are the GraphDef and SavedModel formats. The GraphDef format is a version of the ProtoBuf serialization protocol, in either text or binary, that encodes the definition of a TensorFlow graph. A GraphDef can also include the weights of a trained model as we will see later, but it doesn’t have to — the weights can be stored as separate checkpoint files. The SavedModel format combines a GraphDef (actually a MetaGraphDef, which we’ll also discuss later) with checkpoint files that store weights, all collected in a folder. See this page in the TensorFlow docs for more information about SavedModels. In this post we will work with both GraphDef and SavedModel formats.

Tools and Techniques

There are several techniques in TensorFlow that allow you to shrink the size of a model and improve prediction latency. Here are some of them:

  • Freezing: Convert the variables stored in a checkpoint file of the SavedModel into constants stored directly in the model graph. This reduces the overall size of the model.
  • Pruning: Strip unused nodes in the prediction path and the outputs of the graph, merging duplicate nodes, as well as cleaning other node ops like summary, identity, etc.
  • Constant folding: Look for any sub-graphs within the model that always evaluate to constant expressions, and replace them with those constants.
  • Folding batch norms: Fold the multiplications introduced in batch normalization into the weight multiplications of the previous layer.
  • Quantization: Convert weights from floating point to lower precision, such as 16 or 8 bits.

In this post we will show how to perform each of the techniques listed above. The optimization process we discuss here will suffice to perform basic serving optimization for a broad array of models.

We don’t have the space here to treat any of these techniques in great depth. Quantization, in particular, is a large topic, worthy of several posts by itself. Of course for mobile deployment there is also TFLite, which performs 8 bit quantization on models for mobile. Other optimization techniques we will not discuss include fusing convolution and AOT compilation with tfcompile.

We will use the TensorFlow Graph Transform Tool to perform many of the optimizations, which is a C++ command-line tool. We will show how to use Python APIs to employ the tool. The Graph Transform Tool is designed to work on models that are saved as GraphDef files in the protobuf format. However, the SavedModel format is the most modern, and the one most supported by other tools and services. For example, the model exported after training an Estimator is in SavedModel format. It is the only format supported by Cloud Machine Learning Engine for prediction. So after we optimize our model we will convert it back to the SavedModel format. We will also show how to use the saved_model_cli tool to output the MetaGraphDef definition in the SavedModel, and the Python API for the TensorFlow saved_model package to inspect GraphDefs.

The optimization steps we’ll follow here, along with the model format transitions, are:

  1. Freeze the SavedModel:

SavedModel ⇒ GraphDef

2. Optimize the frozen model:

GraphDef ⇒ GraphDef

3. Convert the optimized frozen model back to SavedModel:

GraphDef ⇒ SavedModel

Generating the Model

Let’s get started. First we need a model. We will use one trained on the “Hello World” deep learning data set, MNIST. Here is the code for our model, a simple CNN classifier for MNIST. This code is available in a notebook as well as a script in the github repo.

Here is the code to train the model and export a SavedModel:

The model should take about a minute to train, as we set the max_steps value in the TrainSpec to just 50. We aren’t striving for accuracy here; we just want an example of an exported model. Run this code to perform training and export a SavedModel:

train_data, train_labels, eval_data, eval_labels = load_mnist_keras()
export_dir = train_and_export_model(train_data, train_labels)

Inspecting the SavedModel

Now that we have a SavedModel, let’s inspect the contents. Here is an image of the graph of the model from TensorBoard, where we set num_conv_layers= 3 and hidden_units=[512,512]:

We can use the saved_model_cli tool from the TensorFlow code base to output the MetaGraphDef from the SavedModel. What is a MetaGraphDef? It is a GraphDef with additional information about the “signature” of the model, namely the inputs and outputs. Let’s run the saved_model_cli tool on our just-exported SavedModel:

$ saved_models_base=models/mnist/cnn_classifier/export
$ saved_model_dir=${saved_models_base}/$(ls ${saved_models_base} | tail -n 1)
$ saved_model_cli show — dir=${saved_model_dir} — all

And here is the output, showing the signature of inputs and outputs to our graph. The MetaGraphDef format can contain multiple signature definitions, but the estimator.export_savedmodel method exports only a single signature labeled ‘serving_default’:

MetaGraphDef with tag-set: ‘serve’ contains the following SignatureDefs:signature_def[‘serving_default’]:
The given SavedModel SignatureDef contains the following input(s):
inputs[‘input_image’] tensor_info:
dtype: DT_FLOAT
shape: (-1, 28, 28)
name: serving_input_image:0
The given SavedModel SignatureDef contains the following output(s):
outputs[‘softmax’] tensor_info:
dtype: DT_FLOAT
shape: (-1, 10)
name: softmax/Softmax:0
Method name is: tensorflow/serving/predict

The ‘input_image’ input is the one we defined in the make_serving_input_receiver_fn above. The ‘softmax’ output is defined in the keras model function.

Let’s inspect the GraphDef portion of the MetaGraphDef structure in the SavedModel. We use the aforementioned Python APIs for the TensorFlow saved_model module to load the SavedModel and obtain the GraphDef from the MetaGraphDef:

The following code shows how to display information from the GraphDef. We can output lists of the names of various nodes in our graph, as well as counts of various node types.

And here is the output for our SavedModel:


Input Feature Nodes: [u'serving_input_image', u'input_image']

Unused Nodes: []

Output Nodes: [u'softmax/Softmax']

Quanitization Nodes: []

Constant Count: 61

Variable Count: 97

Identity Count: 30

Total nodes: 308

Let’s take a look at the model size, using the following code:

The size of a SavedModel size can be roughly divided into the size of the GraphDef, and the size of the Variables (i.e. the weights of the model). Run the above code to obtain the size of our model before any optimization:

models/mnist/keras_classifier/export/1540846525Model size: 57.453 KB
Variables size: 10691.978 KB
Total Size: 10749.431 KB

A Baseline Prediction Benchmark

Now let’s benchmark the time it takes to perform predictions on our unoptimized graph. Before we do this we should say a few words about benchmarking methodology.

Benchmarking ML model performance is a deep topic by itself. We have to proceed carefully here. There are a lot of ways to do this incorrectly, resulting in an inaccurate evaluation of the impact of our model optimizations. There are several methods we might use to benchmark prediction performance of a TensorFlow model. Here are some likely candidates:

  1. Use Python code and the contrib.predictor module to execute predictions against the model on a local development system.
  2. Use the prediction service of Cloud Machine Learning Engine to test speed of predictions made using that API.
  3. Use the REST API of TensorFlow Serving to test the speed of predictions.

There are a few key attributes of any benchmark methodology, whether we are testing ML models (or anything else):

  1. Repeatability: that is, the results of our benchmark exhibit a low variance across different runs.
  2. Controlled environment: we want to have a large degree of knowledge and control over the benchmark environment, so we can make definitive and accurate statements about the thing we are testing. If we lack control over the environment then we may be drawing conclusions about a different system than the one we purport to benchmark.
  3. Limited focus: we want to limit the focus of the benchmark as far as possible. In any system as complex as an ML prediction pipeline there will be many system factors that come into play to determine performance. In our case we want to only test, in so far as is possible, the performance differences due to the model graph itself. Ideally, other factors like network performance, the speed of making API calls and processing results, and the available compute power should not influence our benchmark results.

Poor choices with respect to B) and C) will typically show up in poor results for A). In other words, we will see a lot of variability in the benchmark results if we don’t run our tests in a controlled environment with limited focus.

The example code shows how to run inference benchmarks using each of the three technical options above. However according to the criteria A-C above, the TensorFlow Serving option is the best candidate. We can easily run the benchmark against a local instance of TF Serving, running in a docker container. This gives us a high degree of control over the environment, and limits the focus of the test as far as possible to the performance of the graph itself. There is minimal impact of network performance since the requests are made on the local network interface. The processing of the prediction graph is solely performed by the TensorFlow runtime and the TF Serving engine, all of which are implemented in C++ code which is more or less deterministic in its execution. We will make the API calls using the REST API of TF Serving, which requires very little processing for setting up and receiving the results.

The Python code method is not the right choice because of lack of control over the environment. The environment in this case is the Python interpreter itself. Because Python does so much under the hood when it is running your code, including allocation of memory, management of data and garbage collection, there is a lot that we do not control. Indeed we observe that when running local tests on the model using the “pure Python”-based inference test in inference_test.py, there is a lot of variability in the benchmark results from run to run. This variability could cause us to incorrectly evaluate the performance impact of our graph optimizations. Of course we could run many tests and average the results to account for this variability. But this makes the benchmark methodology more complex, adding a layer of statistical validation which we’d rather not have to do.

We could also use the prediction service of Google Cloud ML Engine to perform the benchmark; the sample code also shows how to do this. This service is a great choice to easily deploy your production models for serving. And of course everything we discuss here is in the name of optimizing performance of serving in production! But because of the lack of control over the environment, and lack of focus, it does not make a great system for accurately and repeatably quantifying the impact of the graph optimizations. There are too many aspects of the environment for Cloud ML Engine prediction that we do not control as a benchmark environment. An auto-scaling service like CMLE Prediction will automatically deploy resources that scale according to your usage, and those resources may change during the course of the benchmark run. Resource availability may change over time and be different according to the zone you execute in. Since the service is a cloud API, both local and cloud network performance will necessarily come into play, thus expanding the focus of the test to include the network.

Nonetheless we can be confident that if we are successful in optimizing our TensorFlow graphs to make a non-insignificant increase in serving speed, we will see an increase in latency performance and decrease in cost when we deploy our models on CMLE prediction. Note that factors like batch size and model complexity will have an impact on the relative improvement from optimizations in production. We will say a little more about that below.

To run the benchmark we first start a local docker container running TF Serving with our model loaded. The tfserving.sh script in the sample code shows how to do this. In order to use the script you must first install docker and the TF Serving container on your system; see this page for more information. You should run the tfserving.sh script in a separate shell window as the benchmark test, since TF Serving will produce console output as it runs.

We call the tfserving.sh script to start a local serving instance pointing to our model, then call the inference_tfserving method using the inference_test.py script in a separate shell, to execute the benchmark.

$ ./tfserving.sh$ python inference_test.py tfserving serving_default

We get the following results:

Total elapsed time: 189.821555 seconds
Batch size 100 repeated 1000 times
Average latency per batch: 0.189821555 seconds

The absolute results may be different for you, depending on the system you run this on. Nonetheless they should exhibit a very small degree of variability from run to run, no matter what system you run them on. We ran our benchmark on an instance created using the Deep Learning images for GCP, providing an extra degree of control over the environment. The sample code includes instructions showing how to launch such an instance on GCE.

Optimizing the model

It’s time to perform some optimization on our model. The first step we must perform is to “freeze the weights” of the model, merging the weights stored in the separate variable files into the GraphDef. Why do this first? Well, we will be performing many operations on the graph, merging nodes, pruning nodes, and generally altering the graph structure. We need to perform those same operations on the variables associated with the graph, namely the weights. If the weights remain separate as we modify the graph, then the structure of the weights will no longer map to the structure of the model, and we will be unable to restore the weights for the model, making it useless for serving.

Freezing the graph is also a form of optimization. In a SavedModel the weights are represented as “Const” ops in the graph definition, and also as Variables in the checkpoint files. Combining the two removes the redundant storage of weights as Variables, and also obviates the requirement to load and merge the weights separately, which should result in a faster loading process. This operation will reduce the total size of the model by a modest amount. The freeze_graph tool also performs some pruning of the graph, as we will see below.

Freezing is quite useful when deploying models on mobile devices, where the reduction in model size, ability to download the model in a single file, and reduced loading time are all important. The TensorFlow Lite framework automatically handles freezing the graph for mobile models.

When freezing a graph you also need to specify the output nodes of the final frozen graph. Our model contains only one output node, ‘softmax’, but sometimes when you create models using the high-level APIs of TensorFlow, there may be multiple output nodes created. When you freeze a graph, output nodes other than the ones specified will be removed. For serving predictions we typically only need one set of outputs. Removing extra output nodes is an optimization that reduces the computation performed by the graph, thus reducing its size and increasing prediction speed.

To freeze the graph we use the freeze_graph tool in TensorFlow, which is a binary command line tool. In addition to freezing the weights, the freeze graph tool prunes the graph to include only nodes that are used to evaluate the output nodes that we specify.

As we mentioned above, we will actually call the tool using Python APIs:

We freeze our graph by calling the method above:

freeze_graph(saved_model_dir, “head/predictions/class_ids”)

Let’s take a look at the graph after freezing. Running the describe_graph method we defined above on the graph:

frozen_filepath = os.path.join(frozen_model_dir, ‘frozen_model.pb’)

We get the following output:

models/mnist/cnn_classifier/export/1536079934/freezed_model.pbInput Feature Nodes: [u'serving_input_image']Unused Nodes: []Output Nodes: [u'softmax/Softmax']Quantization Nodes: []Constant Count: 34Variable Count: 0Identity Count: 27Total nodes: 94

Notice that the number of Variable nodes has dropped to 0, compared to 97 before the freeze operation. All of the Variables in the previous graph have been replaced with Constants that store the value of the weights.

The total size of the model has decreased slightly:

get_size(saved_model_dir, ‘frozen_model.pb’)models/mnist/cnn_classifier/export/1536079934/frozen_model.pbModel Size: 10702.063 KB

Optimizing the Graph

Now that we have a frozen graph, we can perform the other optimizations we listed above. Most of them can be performed in a single step, using the Graph Transform Tool. This is a command-line tool that is included in pre-built binaries for TensorFlow, and can also be built from source. We will call the tool using Python APIs. The code below shows how:

We call the code on our model by passing a list of the desired optimizations, like so:

transforms = [
optimize_graph(saved_model_dir, “frozen_model.pb” , transforms, ‘head/predictions/class_ids’)

The first three optimizations in the list fall into the “pruning” category above. They clean unused or duplicate nodes from the graph. The ‘strip_unused_nodes’ transform is basically identical to the one performed by the freeze graph tool, removing all nodes that are not connected to the output. So this operation is technically redundant in the current workflow. We include it here just to make the point that the Graph Transform Tool also has the capability to perform this optimization. The fourth and fifth optimizations are the ‘constant folding’ and ‘folding batch norms’ ones mentioned above.

The Graph Transform Tool includes several other interesting transforms, including Quantization, which we will discuss later. You can even write your own transforms and plug them into the tool. See the online documentation for the tool for more details.

If you are using batch normalization in your model, you can always gain performance by running the ‘fold_batch_norms’ transform. If you are not using batch normalization in your deep neural network or convolutional neural network model, you probably should be. It will nearly always increase training speed and accuracy. And as of TensorFlow 1.10, it is built into the “Canned Estimators’ for DNNs.

Let’s inspect the graph after optimization. Run the same describe_graph method we used above:

optimized_filepath = os.path.join(saved_model_dir,’optimized_model.pb’)

We get the following output:

models/mnist/cnn_classifier/export/1536341328/optimized_model.pbInput Feature Nodes: [u’input_image’]Unused Nodes: []Output Nodes: [u’head/predictions/class_ids’]Quantization Nodes: []Constant Count: 29Variable Count: 0Identity Count: 0Total nodes: 62

The optimized graph is even smaller than the frozen graph, with all identity nodes removed. If you’re interested in more detail on the optimizations and the resulting graph, try passing the show_nodes=True parameter to the describe_graph method.

The size of the optimized graph is just slightly smaller than the frozen graph:

get_size(saved_model_dir, ‘optimized_model.pb’)models/mnist/cnn_classifier/export/1537571371/optimized_model.pbModel size: 10698.921 KB

Compared to the original graph, there has been a large reduction in nodes in the optimized graph. All the unused nodes have been removed, there is only one output node, and the number of variables and constants is significantly reduced. Yet the functionality of this graph should be exactly equivalent to the original SavedModel graph!

Of course this is a statement that we should “trust, but verify.” You should always test an optimized model on your original test set, to verify that the accuracy and other metrics are not degraded. None of the optimizations we have performed to this point should change the output of the model for any given prediction input.


Quantization is a technique that can both reduce the size of a TensorFlow model and increase the speed of inference. With quantization you can achieve dramatic speedups in inference, particularly if you can harness the specific capabilities of acceleration hardware designed for quantized models. See this post for an example. You can also reduce your model size by a factor of 2 or 4x or potentially even 8x, by reducing the precision of weights used in serving from 64 bit or 32 bit to 16 or 8 bits. As mentioned earlier, this is too large a topic to treat in any depth here. It will suffice to point out that the Graph Transform Tool can be used to quantize your models. Here is how we could use the tool to quantize our current model, for example:

transforms = [
optimize_graph(saved_model_dir, None, transforms, ‘head/predictions/class_ids’)

The ‘quantize_weights’ transform compresses the existing weights in the model to 8 bit, followed by a decompression op which converts the single byte weights back to floats. This results in a large reduction in model size, but no corresponding speedup, since calculations are still being performed in floating point. The more complex ‘quantize_nodes’ optimization actually converts all the calculations performed with weights into 8-bits, with conversions from floating point before and after each computation. This can speed inference quite a lot.

Note that both transforms can have an effect on model accuracy due to the reduction in weight precision. Of course the best practice of testing model performance before and after optimization applies even more here.

Again, if you are working with models deployed to mobile, check out TensorFlow Lite, which is specifically tailored to perform quantization for mobile.

Converting the Optimized Graph back to SavedModel

As the next step, we will convert our frozen, optimized GraphDef back to a SavedModel. Basically the conversion process consists of adding back the MetaGraphDef information, to specify the inputs and outputs for the model. This turns out to be pretty easy, using the simple_save method of the saved_model module in TensorFlow. This method generates a default MetaGraphDef for the graph.

We do have to pass the inputs and outputs of the model to the simple_save method. The outputs are the same ones we specified when optimizing the graph, and the inputs can be found easily in a generic way by looking for Placeholder nodes in the graph. We call the method like this, passing a new directory to contain our SavedModel:

optimized_export_dir = os.path.join(export_dir, ‘optimized’)
optimized_filepath = os.path.join(saved_model_dir, ‘optimized_model.pb’)
convert_graph_def_to_saved_model(optimized_export_dir, optimized_filepath)

Now that we have a SavedModel again, we can output the MetaGraphDef using saved_model_cli:

$ saved_models_base=models/mnist/cnn_classifier/export
$ optimized_model_dir=${saved_models_base}/optimized
$ saved_model_cli show — dir=${optimized_model_dir} — all

The output is:

MetaGraphDef with tag-set: ‘serve’ contains the following SignatureDefs:signature_def[‘serving_default’]:
The given SavedModel SignatureDef contains the following input(s):
inputs[‘serving_input_image’] tensor_info:
dtype: DT_FLOAT
shape: (-1, 28, 28)
name: serving_input_image:0
The given SavedModel SignatureDef contains the following output(s):
outputs[‘softmax’] tensor_info:
dtype: DT_FLOAT
shape: (-1, 10)
name: softmax/Softmax:0
Method name is: tensorflow/serving/predict

Benchmarking the Optimized Model

One final step. Let’s quantify the effect of all our optimizations on inference speed, using the TF Serving-based method we discussed above. We need to relaunch the TF Serving container to point to our newly optimized SavedModel. If you pass an argument to the tfserving.sh script, it will append that folder name to the model export path and launch a TF Serving instance that points to the corresponding model:

$ docker kill $(docker ps -q)
$ tfserving.sh optimized

Rerunning the benchmark on the optimized model,

$ python inference_test.py tfserving serving_default

we get the following results:

Total elapsed time: 162.434886 seconds
Batch size 100 repeated 1000 times
Average latency per batch: 0.162434886 seconds

This represents a increase in prediction speed of 17% from the original model. Not a bad return for just running a few scripts. That is the kind of improvement in speed and reduction in cost that can make a meaningful difference in the ROI of your ML application.

Factors Affecting the Benefits of Optimization

Note there are many variables that can influence the magnitude of the performance difference we see, including the batch size of predictions. Many online prediction applications use a batch size of 1, where single examples are uploaded to the API at a time. This tends to reduce the benefit of any model optimizations. Model complexity also has an impact. With larger, more complex models you may see even more increases in performance as a result of these optimizations. More detailed analysis of all these factors will have to wait for a future post.

Wrapping Up

We hope this has been a helpful introduction to the important topic of optimizing TensorFlow models for serving predictions. Along the way we’ve learned something about the underlying representation of TensorFlow graphs, the different model export formats, and benchmarking methodology. Armed with the techniques we’ve explored, you can increase the efficiency and utility of your ML production pipelines. Thanks for reading, and happy serving!



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store