TensorFlow Serving 101 pt. 1

Part 1: Saving and serving your model

Stian Lind Petlund


My goal with this tutorial is to explain, as straightforward as possible, how you can:

  • Save a TensorFlow model so it can be loaded with TensorFlow Serving ModelServer and used in production.
  • Serve your model with the TensorFlow Serving ModelServer.
  • Send requests to your model (and get responses).

To get the most out of this tutorial you should be familiar with Python and have written a TensorFlow model before. This is not an introduction to TensorFlow itself, but to the serving system TensorFlow Serving.

I found it convenient to split the tutorial into two parts. In Part 1 you will learn how to save a simple model and serve it with TensorFlow Serving ModelServer. In Part 2 we will send requests to the model using a Python client.


Many guides and blog posts have been written about how to save models for production and serve them using TensorFlow Serving. So, why would I write a tutorial when there are numerous guides and examples already written?

Because I have (and I believe others have too) spent countless hours reading tutorials, documentation, examples and blog posts on how to deploy TensorFlow models for production. Although it’s fairly easy to get your models up and running with TensorFlow Serving, there are many ways to do it and a lot of concepts to digest. These two (many ways to do stuff + new concepts) usually lead down the following paths:

  • You spend way too much time exploring all the options and learning new concepts instead of just building what you were supposed to build in the first place.
  • You copy some stuff you don’t really understand, but it works. When it eventually breaks or you want to build a new feature, you spend all your time figuring out how it works (again).

I have been doing a combination of these two for almost a year now. So here’s a straightforward way to save, serve and send requests to your model. This is how we do it at Epigram. We found a way that works for us, but keep in mind that TensorFlow Serving is a flexible serving system, and the way I see it, there’s no right or wrong in this space.

Alright, let’s get started. All code for part 1 of the tutorial can be found here: https://github.com/epigramai/tfserving-simple-example.

1. Write a simple model

In the snippet below I have written a simple model. Most of you have probably never written such a primitive TensorFlow like this (and why would you, it’s LSTMs and convolutional neural networks that rocks!). But all we need to save our model for production is a simple graph like the one below, so I am not going to do it more complicated that necessary.

Open export_simple_model.py if you want to get the full picture.

placeholder_name = 'a'
operation_name = 'add'

a = tf.placeholder(tf.int32, name=placeholder_name)
b = tf.constant(10)

# This is our model
add = tf.add(a, b, name=operation_name)

with tf.Session() as sess:

# Run a few operations to make sure our model works
ten_plus_two = sess.run(add, feed_dict={a: 2})
print('10 + 2 = {}'.format(ten_plus_two))

ten_plus_ten = sess.run(add, feed_dict={a: 10})
print('10 + 10 = {}'.format(ten_plus_ten))

If you’re not completely new to TensorFlow, the code above should look more or less like the most primitive model you’ve ever written. You feed the model with a number, the model adds that number to 10 and returns the sum.

2. Save the model

In the example above we created a placeholder, a constant and an operation on the default graph. Then we started a session and ran the add operation. We could wrap this code in an API endpoint written in a Python framework like Flask, Falcon or similar, and voilá we have an API. But there are some really good reasons you don’t want to do it that way:

  • If your model(s) are complex and run slowly on CPU, you would want to run your models on more accelerated hardware (like GPUs). Your API-microservice(s), on the other hand, usually run fine on CPU and they’re often running in “everything agnostic” Docker containers. In that case you may want to keep those two kinds of services on different hardware.
  • If you start messing up your neat Docker images with heavy TensorFlow models, they grow in every possible direction (CPU usage, memory usage, container image size, and so on). You don’t want that.
  • Let’s say your service uses multiple models written in different versions of TensorFlow. Using all those TensorFlow versions in your Python API at the same time is going to be a total mess.
  • You could of course wrap one model into one API. Then you would have one service per model and you can run different services on different hardware. Perfect! Except, this is what TensorFlow Serving ModelServer is doing for you. So don’t go wrap an API around your Python code (where you’ve probably imported the entire tf library, tf.contrib, opencv, pandas, numpy, …). TensorFlow Serving ModelServer does that for you.
  • Most importantly, the TensorFlow team wrote TensorFlow Serving and the ModelServer for a reason. They are probably better than you when writing a high performance serving system. Use it!

Great! Now that we’ve agreed on using TensorFlow Serving ModelServer to serve our model. It’s time to save the model. Here’s an outline for the rest of this section:

  • First we have to grab the input and output tensors.
  • Create a signature definition from the input and output tensors. The signature definition is what the model builder use in order to save something a model server can load.
  • Save the model at a specified path where a server can load it from.

First we have to figure out which nodes are input and output nodes. Our simple math model is a + b = add. If we replace the constant, we get a + 10 = add. From placeholder a and operation add, we can grab input tensor a:0 and output tensor add:0.

In the snippet below we grab our input and output tensors, and build a signature definition that we will use to save the model.

# Pick out the model input and output
a_tensor = sess.graph.get_tensor_by_name(placeholder_name + ':0')
sum_tensor = sess.graph.get_tensor_by_name(operation_name + ':0')

model_input = build_tensor_info(a_tensor)
model_output = build_tensor_info(sum_tensor)

# Create a signature definition for tfserving
signature_definition = signature_def_utils.build_signature_def(
inputs={placeholder_name: model_input},
outputs={operation_name: model_output},

We’re going to pause here for a few seconds. There is an important thing about input and output naming I want to point out.

The names placeholder_name and operation_name are the strings ‘a’ and ‘add’. You can use whatever strings you like to name the input and output of your models; ‘inputs’ and ‘outputs’ are also fine names. In fact TensorFlow has defined some constants for us that we can use! These constants are defined in signature_constants.py, and there are three sets of constants: predictions, classification and regression. If you peek into signature_constants.py you’ll see that the input and output constants are ‘inputs’ and ‘outputs’.

I am going to use ‘a’ and ‘add’ to show you that you don’t have to specify TensorFlows constants here. If you ever write a model with multiple inputs and outputs you have to come up with names yourself. In most cases using the constants makes sense. I am just showing you that it’s not required for your models to work with TensorFlow Serving ModelServer.

A place where you actually have to use a string TensorFlow has defined is the third keyword parameter called method_name. It must be one of tensorflow/serving/predict, tensorflow/serving/classify, tensorflow/serving/regress. They also defined in signature_constants.py as CLASSIFY_METHOD_NAME, PREDICT_METHOD_NAME and REGRESS_METHOD_NAME. I am not sure exactly why we need this to save the model, and the documentation is quite poor here. The model server will give you an error if you don’t use one of these constants.

Alright, let’s finish saving the model. We pass the path where we want the model stored to the builder. The last part of the path is the model version. For a model that is actually trained on real data, use this number to increase the model version when you retrain them.

Again, we will pass some constants that TensorFlow has defined for us to the builder. Pass the signature definition we defined in the previous snippet and save the model.

builder = saved_model_builder.SavedModelBuilder('./models/simple_model/1')

sess, [tag_constants.SERVING],

# Save the model so we can serve it with a model server :)

Run the code and you should have your model ready for serving. If the code runs without errors you can find the model in models/simple_model/1.

3. Serving the model

If you finished the two previous steps without trouble, we are ready to serve the model now.

First, there are many ways to serve your models with TensorFlow Serving. Feel free to dive into TensorFlow Serving source code and build your own model server if you like to. I’ll stick to a simple CPU compiled server for the rest of the tutorial.

You can find the model server we’re going to use here. It can be installed with apt-get, but I already built a Docker image with the server so I’ll use that. If you are not familiar with Docker or use a Linux system with apt you can install the server on your system.

To serve the model all you need to do is run the file run_model.sh. I’ll take you through what’s going on when running that file.

docker run -it -p 9000:9000 --name simple -v $(pwd)/models/:/models/ epigramai/model-server:light --port=9000 --model_name=simple --model_base_path=/models/simple_model

The Docker image is publicly available and Docker wiIl download it for you when you run the script. If the script runs fine, you now have a running container. A container is basically an instance of an image. We pass a lot of options and flags here, so I’ll explain what they all do.

  • When you do docker run, you run the image epigramai/model-server:light. The default entrypoint for this image is tensorflow_model_server. This means that when you run the container, you also start the model server.
  • Because the model is not built into the image (remember, the image is just the model server) we make sure the container can find the model by mounting (-v) the models/ folder to the container.
  • The -it option basically tells docker to show you the logs right in the terminal and not run in the background. The name option is just the name of the container, this has nothing to do with TensorFlow or the model.
  • Then there’s the -p option, and this one is important. This option tells docker to map its internal port 9000 out to port 9000 of the outside world. The outside world in this case is your computer also known as localhost. If we omitted this option, the model server would serve your model on port 9000 inside the container, but you would not be able to send requests to it from your computer.
  • The three last flags are sent all the way to the model server. The port is 9000 (yep, the port we are mapping out to your machine). With the model_name flag we give our model a name. And with the last flag we tell the model server where the model is located. Again, the model is not in the image, but because we used the -v option and mounted the folder to the container, the model server can find the model inside the running container.

❕ If it looks like I am mixing flags and options here, I am not. The options are for Docker (the ones with a single -) and the flags (port, model_name, model_base_path) are parameters to the model server.

If you see an output like this, you are now serving the model with TensorFlow Serving!

❕The model server installed in epigramai/model-server:light is compiled for newer CPUs. Most computer these days should be able to use it. If you get an error when you run the container, you could try this image instead: epigramai/model-server:light-universal. If you the read the install instructions on TensorFlow Serving pages, you can also find tensorflow-model-server-universal there. That’s the same one I’ve built into epigramai/model-server:light-universal.

Congrats! You are now serving a model with TensorFlow Serving. I wish I could tell you that the model server is a normal HTTP server and that you can POST your data and get a response. Unfortunately, it’s not that simple. That’s why I chose to split this tutorial in two parts. The model server runs something called a gRPC service, and I’ll tell you need to know about it (and how to send requests) in part 2.

You probably want to clean up a bit. If you are familiar with Docker, you know that hitting ctrl + c should stop a running container. Unfortunately, it doesn’t work in this case (at least for me, and I am not really sure why). So open a new terminal tab and run:

docker stop simple && docker rm simple

If you cannot open a new terminal tab, it’s possible to hit ctrl + pq to detach from the container to get a free terminal window.

I like to point things out, so I am going to point out that simple in the two commands above is related to Docker and is the container name. We gave the model and container the same name.

Is this really production ready? What happened to GPUs?

The short answer to this question is: Yes, the model you saved can be hosted on any TensorFlow Serving system. Yay! If your model is complex and should be served on accelerated hardware like GPUs, you cannot use the Docker image / the apt-get version of the model server we use in this tutorial. I am not going to include GPU serving in the scope of this tutorial, but I may write up a new blog post on that topic if I get any requests from readers.

Thank you for reading, I really hope you learned something! Leave a comment if you have questions or comments. Looking forward to see you in part 2 😃



Stian Lind Petlund

CTO @Dagens, dagens.farm, building the new food system. Former Co-founder of Epigram.