How Zendesk Serves TensorFlow Models in Production

How We Started With TensorFlow

At Zendesk we are developing a series of machine learning products, the most recent of which is Answer Bot. It uses machine learning to interpret user questions and responds with relevant knowledge base articles. When a customer has a question, complaint or enquiry, they may submit their request online. Once their request is received, Answer Bot will analyse the request and suggest relevant articles which may best assist with the customer’s request via email.

Answer Bot uses a class of state-of-the-art machine learning algorithms known as deep learning to identify relevant articles. We use Google’s open-sourced deep learning library, TensorFlow, to build these models, leveraging Graphics Processing Units (GPUs) to accelerate the process. Answer Bot is our first data product at Zendesk to use Tensorflow. After a heavy investment of blood, sweat, and tears by our Data Scientists we have Tensorflow models that work pretty well for Answer Bot. Hooray!

But creating the models is only part of the problem, our next challenge is finding a way to serve the models in production. The model serving system will be subjected to a large volume of traffic. It is important for us to ensure that the software and hardware infrastructure serving these models is scalable, reliable and fault tolerant. Let’s talk about our experience in productionizing the serving of TensorFlow models.

Oh and by the way this is our team — Zendesk’s machine learning data team. Our team consists of a bunch of Data scientists, Data Engineers, a Product Manager, UX/Product Designer and a Test Engineer.

Data Team At Zendesk Melbourne

Serving TensorFlow Models

After a series of conversations between Data Scientists and Data Engineers, we arrived at these key requirements:

  • Low latency at prediction time
  • Horizontally scalable
  • Fits into our micro-services architecture
  • Ability to A/B test different versions of the model
  • Compatible with newer releases of TensorFlow
  • Supports other TensorFlow models to allow for future data products

TensorFlow Serving

After some digging around on the Internet, Google’s TensorFlow Serving emerged as our top choice for model serving. Tensorflow Serving is written in C++, that supports serving of machine learning models. An out-of-the-box Tensorflow Serving installation supports :

  • Serving of TensorFlow models
  • Scanning and loading of TensorFlow models from the local file system

TensorFlow Serving treats each model as a servable object. It periodically scans the local file system, loading and unloading models based on the state of the file system and the model versioning policy. This allows trained models to be easily hot-deployed by copying the exported models to the specified file path while Tensorflow serving continues running.

TensorFlow Serving Architecture

Based on the benchmark reported in this Google blog post, they have recorded approximately 100,000 queries per seconds, excluding TensorFlow prediction processing time and network request time.

More information on TensorFlow Serving architecture in the TensorFlow Serving documentation.

Communication Protocol (gRPC)

TensorFlow Serving exposes the gRPC interface for invoking prediction from the models. gRPC is an open sourced, high performance remote procedure call (RPC) framework which runs on HTTP/2. A couple of interesting enhancements with HTTP/2 compared to HTTP/1.1 includes its support for request multiplexing, bidirectional streaming and transport via binary rather textual.

By default, gRPC uses Protocol Buffers (Protobuf) as its message interchange format. Protocol Buffers are Google’s open sourced mechanism for serializing structured data in efficient binary format. It’s strongly typed making it less error prone. The data structure is specified in a .proto file that can then be compiled into gRPC request classes in a variety of languages including Python, Java and C++. This was my first time working with gRPC, and I was curious to see how it performs compared to other API architectures such as REST.

Model Training and Serving Architecture

We decided to separate the training and serving of the deep learning models as two pipelines. The diagram below provides an overview of our model training and serving architecture:

Model Training and Serving Architecture

Model Training Pipeline

Steps in model training are:

  • our training features are generated from data already provided in Hadoop.
  • the generated training features are stored in AWS S3.
  • TensorFlow models are then trained with GPU instances in AWS using batch of training samples from S3.

Once the model is built and validated, it is published to the model repository in S3.

Model Serving Pipeline

The validated models are served in production by shipping the models from the model repository to the TensorFlow Serving instances.


We run TensorFlow Serving on AWS EC2 instances. Consul is setup in front of the instances for service discovery and distributing the traffic. The clients connect to the first available IP return from the DNS lookup. Alternatively elastic load balancing could be used for more advanced load balancing. As the prediction with the TensorFlow models is inherently a stateless operation, we could achieve horizontal scalability by spinning up more EC2 instances.

Another option is to use the Cloud ML offered by Google Cloud Platform, which provides serving of Tensorflow models as a fully managed service. However the Cloud ML service was in alpha phase when we rolled out TensorFlow Serving around September 2016 and was lacking in functionalities required for production usage. Therefore we opted with hosting in our own AWS EC2 instances to allow more fine grain controls, and predictable resource capacity.

Implementation of Model Serving

When we first deployed TensorFlow Serving in 2016, there wasn’t an option to download the prebuilt binary directly and we had to compile from source. With the recent release of TensorFlow Serving v1.0, users can directly install TF Serving via apt-get install if you’re using Ubuntu. See the following Google Developers blogpost for more information.

Here’s the steps we took to get TensorFlow Serving deployed and running:

  1. Compile TensorFlow Serving from source

First, we need to first compile the source to produce the executable binary. The binary can then be executed from command line to start up the serving system.

Assuming that you have Docker setup, a good starting point for compiling the binary is to use the provided Dockerfile. Follow the steps below:

  • Run the code in this gist to build a docker container suitable for compiling TensorFlow Serving.
  • Run the code within this gist within the running docker container to build the executable binary.
  • Once the compilation finishes, the executable binary will be available in the following path in your docker image: /work/serving/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server

2. Running Up Model Serving System

The executable binary (tensorflow_model_server) can be deployed to your production instances. You can also run up TensorFlow Serving in docker container if you are using docker orchestration framework such as Kubernetes or Elastic Container Service .

Let’s assume that the TensorFlow model are stored on the production hosts in the directory /work/awesome_model_directory. You can run up TensorFlow Serving on port 8999 with your TensorFlow models with the following command:

<path_to_the_binary>/tensorflow_model_server — port=8999 — model_base_path=/work/awesome_model_directory

By default, TensorFlow Serving scans the model base path every second and is customisable. The optional configurations that are available as command line arguments are listed here.

3. Generate Python gRPC Stubs from Service Definitions

The next step is to create gRPC client that can invoke prediction on the model server. You can download these from pip using the following command:

pip install tensorflow-serving-api

4. Calling the service from remote host

A python client can be created to invoke gRPC call on the server using the compiled definition. See this for example which shows a Python client that calls TensorFlow Serving synchronously.

TensorFlow Serving also supports batching of predictions for performance optimisation purposes. To enable this you should run up the tensorflow_model_server with the the flag — enable_batching turned on. See this for example of asynchronous client.

Loading Models from Other Storage

What if your models are not stored in the local file system? And you would like TensorFlow Serving to read directly from external storage system such as AWS S3 and Google Storage.

If that’s the case, you will need to extend TensorFlow Serving to read from those sources via Custom Source. Out of the box TensorFlow Serving only supports loading of models from file system.

Lessons learned

We have been using TensorFlow Serving in production for approximately half a year, and our experience with it has been quite smooth. Its has a good latency for prediction time. Below is a graph of the 95th percentile of prediction time in seconds for our production TensorFlow Serving instances across a week (approximately 20 milliseconds):

Nevertheless, along the journey of productionising TensorFlow Serving, there were a few lessons that we have learnt.

  1. Joy and Pain of Model Versioning

We have had a few different versions of the TensorFlow models in production thus far, each with varying characteristics such as network architecture, training data etc. Gracefully handling the different versions of the model has been a non-trivial task. This is because the input request passed to the TensorFlow Serving often involved a number of pre-processing steps. And these preprocessing steps can vary between the TensorFlow model versions. Mismatch between the pre-processing steps and model version could potentially results in erroneous predictions.

1a. Be explicit about the version you’re after

A simple but useful way that we found for preventing erroneous predictions is to use the versions attribute specified in the model.proto definition which is optional (which compiled to This guarantees that you would always match your request payload with the expected model version.

When you request for a given version e.g. version 5 from the client, if the TensorFlow Serving server is not serving that particular version, it will return an error message indicating that the model is not found.

1b. Serving up multiple model versions

The default behavior of TensorFlow Serving is to load and serve the latest version of the model.

When we first implemented TensorFlow Serving in September 2016, it did not support serving multiple versions of the model simultaneously. This means that there’s only one version of the model served at a given time. This was not sufficient for our use case as we would like to serve multiple versions of the model to support A/B testing of different neural network architecture.

One of the options would be to run up multiple TensorFlow Serving processes on different hosts or ports such that each process serves up a different model version. This setup requires either:

  • the consumer applications (gRPC client) to contain switching logic and knowledge of which instance of TensorFlow Serving to call for a given version. This adds complexity to the clients and was not preferred.
  • a registry which maps the version to different instances of TensorFlow Serving.

A more ideal solution is for TensorFlow Serving to serve up multiple versions of the model.

I’ve decided to use one of my lab days to extend TensorFlow Serving to serve multiple versions of model. At Zendesk, we have the concept of “lab day” where we could spend 1 day in every 2 weeks to work on something that we are interested in, let it be tools that could improve our day to day productivity, or a new technology that we are keen to learn. It has been more than eight years since I last worked on C++ code. However, I am impressed at how readable and clean the Tensorflow Serving codebase is, making it easy to extend on. The enhancements to support multiple versions were submitted and have since been merged into the main codebase. TensorFlow Serving maintainers are quite prompt in providing feedback on patches and enhancements. From the latest master branch, you can start up TensorFlow Serving to serve up multiple model versions with the extra flag of model_version_policy:

/work/serving/bazel-bin/tensorflow_serving/model_servers/tensorflow_model_server — port=8999 — model_base_path=/work/awesome_model_directory — model_version_policy=ALL_VERSIONS

An important point to note is that there’s the trade-off with serving multiple model versions, which is higher memory usage. Therefore when running with the above flag, remember to remove obsolete model versions in the model base path.

2. Compression is Your Friend

When you are deploying a new model version, it’s recommended to compress the exported TensorFlow model files into a single compressed file before copying it into the model_base_path. The Tensorflow Serving tutorial contains steps to export a trained Tensorflow model. The exported checkpoint TensorFlow model directory generally has the following folder structures :

A parent directory that consist of a version number (0000001 for e.g.) and contains contains the following files:

  • saved_model.pb — the serialized model which includes the graph definition(s) of the model, as well as metadata of the model such as signatures.
  • variables are files that hold the serialized variables of the graphs.

To compress the exported model:

tar -cvzf modelv1.tar.gz 0000001

Why compress it?

  1. It’s faster to transfer or copy around
  2. If you copy the exported model folder directly into the model_base_path, the copy process may take awhile and you could of ended up having export files copied but the corresponding meta file is not copied yet. If TensorFlow Serving started loading your model and is unable to detect the meta file, the server will fail to load the model and stop trying to load that particular version again.

3. Model size matters

The TensorFlow models that we have are fairly large — between 300Mb to 1.2Gb. We noticed that when the model size exceeded 64Mb, we will get an error while trying to serve up the model. This is due to a hardcoded 64Mb limit in the protobuf message size as described in the following TensorFlow Serving Github issue.

As a result, we applied the patch described in the Github issue to change the hardcoded constant value. Yuck…. :( This is still a mystery to us. Let us know if you manage to find alternative methods of allowing serving of models larger than 64Mb without changing the hardcoded source.

4. Avoid the Source Moving Underneath You

We have been building the TensorFlow Serving source from master branch as at the time of implementing, the latest release branch (v0.4) lags behind master in terms of functionalities and bug fixes. Therefore if you’re building source by checking out masters only, the source may change beneath you whenever new changes are merged into masters. To ensure repeatable builds of the artefact, we find that it’s important to checkout the specific commit revisions rather for:

  • TensorFlow Serving and
  • TensorFlow (Git submodule within TensorFlow Serving)

Wish List for Future Enhancements

Here’s some of the features that we would be very interested to see made available on TensorFlow Serving:

  • health check service methods
  • ability to support multiple model types in a TensorFlow Serving instance
  • out of the box loading of models from distributed storage such as AWS S3 and Google Storage
  • out of the box support for models greater than 64Mb in size
  • sample of Python clients that do not depends on TensorFlow

Thanks to the following friends and colleagues for taking the time to review this blogpost: Soon Ee Cheah, Jeffrey Theobald, Bob Raman, Chris Hausler, Chris Holman, Ryan Seddon and Adel Smee.

Useful References