TensorFlow Serving 101 pt. 2

Part 2: Sending requests to your model

Published in

epigramAI

8 min readJan 28, 2018

Welcome back! In this part I’ll take you through the process of sending requests to your model. If you haven’t finished part 1 yet, I strongly recommend you do that first.

As I mentioned in part 1, the model server is not a regular HTTP endpoint that you can just send a POST request with your data to. It’s a gRPC service, and even though it may seem complicated at first, it’s actually a pretty cool concept.

What you need to know about gRPC

From gRPC’s own pages:

In gRPC a client application can directly call methods on a server application on a different machine as if it was a local object, making it easier for you to create distributed applications and services. As in many RPC systems, gRPC is based around the idea of defining a service, specifying the methods that can be called remotely with their parameters and return types.

This basically means that if you have an app where a user wants to see all his or her friends, when the user clicks “My Friends” in the UI, the client calls the function getFriends(). No need to send a GET request to the endpoint /getFriends. That sounds like a great concept! The server side code is automatically remotely available in the client. No more GET, POST, PUT, just call the methods! But wait, how can the client know that there’s a method called getFriends on the server? Is the client running the server side code too?

Actually, the answer is (almost) yes. The gRPC client side has a stub that provides all the methods the server has. Of course, when the data travel over the wire, it’s no more magic than HTTP going on. But it’s the way you as a developer write the communication that is different!

We’re not aiming to become gRPC experts in this tutorial (and I am not an expert on gRPC), but if you want to learn more I can recommend this 30 minute video. Don’t watch it now, do it when you have finished the tutorial.

A good introduction to gRPC.

For now, let’s focus on using a client I’ve already built. You don’t need to understand or know anything about gRPC to complete the next section.

Send requests to the model server

The client works without TensorFlow installed. So if you are using conda or virualenv (or similar environment tool), create a new environment (without TensorFlow of course…) if you don’t believe me.

The reason we don’t want TensorFlow client side is because anyone who wants to integrate with your model shouldn’t have to deal with TensorFlow to use it.

Imagine you wrote a kick ass image classifier that can separate hot dogs from not hot dogs. Now your friends who developed an awesome food app wants to use your model, but they know nothing about TensorFlow. So they say: “Hey, what’s the endpoint? Should we just POST an image to your API or …”.

Because you are serving the model with TensorFlow Serving and you know that none your app making friends have ever heard about gRPC, you hand them the client and show them how to use it. Give them a hostname, a model name and a model version number, and they’re all set!

Now, Go ahead and install the client. I should probably submit the client to PyPI, but for now:

pip install git+ssh://git@github.com/epigramai/tfserving-python-predict-client.git

❗️Some readers have had trouble installing the client using the ssh link, try pip install git+https://github.com/epigramai/tfserving-python-predict-client.git, if the ssh link does not work in your case.

Make sure the model you saved in part 1 is running. Start it again if you stopped the container when we cleaned up in the end of part 1. Then start Python and test the client:

>>> from predict_client.prod_client import ProdClient
>>> client = ProdClient('localhost:9000', 'simple', 1)
>>> req_data = [{'in_tensor_name': 'a', 'in_tensor_dtype': 'DT_INT32', 'data': 2}]
>>> client.predict(req_data)

I’ll go through what’s happening here in detail:

The first line imports the client. ProdClient is the client you would use to send requests to an actual model server. I’ve written two other clients as well, they serve different purposes. More on those clients later.
Pass the host, model name and version to the client’s constructor. Now the client knows which server it’s supposed to communicate with.
To do a prediction we need to send some request data. As far as the client knows, the server on localhost:9000 could host any model. So we need to tell the client what our input tensor(s) are called, their data types and send data of the correct type and shape (exactly what you do when you pass a feed_dict to sess.run). Now, in_tensor_name ‘a’ is the same ‘a’ that we used in the signature definition in part 1. The input tensor’s data type must also match the one of the placeholder a from part 1. Note that we use the string ‘DT_INT32’ (and not tf.int32), I will show you where to find this mapping in the next section.
Finally call client.predict and you should get this response:

{'add': 12}

Perfect! a + b = add, or in this case with actual values: 2 + 10 = 12.

That’s pretty much all you need to know to send requests. The next section is for those of you who wants dig deeper and understand how the client uses gRPC.

More about gRPC and the client

I am not going to make this section very long, but I’ll try to take you through the the process of sending a request and getting a response in more detail. I hope that will answer some of your questions (the same questions I had) on how to use TensorFlow Serving and gRPC.

All client code can be found here: https://github.com/epigramai/tfserving-python-predict-client. If you choose to clone the repo and play around with the code, make sure to uninstall the client we installed with pip earlier on.

You don’t need to understand the stuff we go through in this section to use the client. It’s just a more detailed explanation for those who want to learn more.

Protocol Buffers

Look at the files in the /proto folder. These files define a gRPC service for sending and receiving requests. They don’t look much like Python code, and that’s because they aren’t. These files can be used with any language to create a client, you just need the protocol buffer compiler protoc to compile them.

Python gRPC tools include the protoc compiler, and it can be installed with with pip:

python -m pip install grpcio-tools

To generate Python code, run:

python -m grpc_tools.protoc -I protos/ — python_out=predict_client/pbs — grpc_python_out=predict_client/pbs protos/*

❗️The generated files already exist in the repo, so you don’t have to do this! If you run this command you may have to fix some import statements to make the client work again :)

The files this command generates end up in predict_client/pbs/. I know some of these generated files are almost empty, and you are probably wondering what all this generated code does. The files in this folder are used by the production client, that’s all you need to know. Think of the content in this folder as compiled code, so don’t touch it.

Sending a request

Open predict_client/prod_client.py, and take a look at these two lines:

stub = PredictionServiceStub(channel)
...
request = PredictRequest()

What we’re doing here is creating the PredictionService which is defined in prediction_service.proto. The service object on the client is also known as a stub in the gRPC world. Then we create a PredictRequest message defined in predict.proto.

We assign the model spec in PredictRequest:

request.model_spec.name = self.model_nameif self.model_version > 0:
  request.model_spec.version.value = self.model_version

Now, if you peak into predict.proto you can see that the PredictRequest has the field model_spec and in model.proto you find the ModelSpec with fields name and version.

Create tensor proto that is compatible with the model we are serving:

tensor_proto = make_tensor_proto(d[‘data’], d[‘in_tensor_dtype’])

In the function predict_client.util.make_tensor_proto, you can see how I also assign DataType and TensorShapeProto to the TensorProto message. This is where the upper case DT_INT32 data type comes in. Look in types.proto to see all data types.

Add the tensor proto to the inputs map of the PredictRequest:

request.inputs[d[‘in_tensor_name’]].CopyFrom(tensor_proto)

The line map<string, TensorProto> inputs = 2; in predict.proto corresponds to the parameter inputs that we send to the signature definition back in part 1. We chose ‘a’ as our input tensor, that’s why the client must send a tensor proto with ‘a’ as the key in the inputs dictionary.

Finally we call stub.Predict() in the Python code and the request is sent.

Receiving the response

I‘ll be a little less verbose as we go from server and back to client:

predict_response = stub.Predict(request, timeout=request_timeout)

In the bottom of predict.proto, PredictResponse is defined. This message has one field: map<string, TensorProto> outputs = 1;. As you probably guessed already, the key in this map is our ‘add’ from part 1.

The rest of the process is simply converting this response to a dictionary.

InMemoryClient and MockClient

There are two other clients you can use. Those clients are for test and mock usage, they don’t send the request to a server. Documentation on how to use them can be found in the readme.

The mock client is basically just mock. Initiate it with some response on the same format you would expect from a real model. In cases where you are developing locally and you don’t have the model served and correct predictions doesn’t matter, use this client. For instance when you are developing an API on your laptop and the API expects a response from a heavy model, you can use the mock client to pretend to get the response from a real model server.

The in memory client is another way to pretend you have the model served. Instead of using fake data, we can “serve” the model instead. The in memory client loads the model into memory and acts as both client and server. This is convenient if you don’t have a model server at hand, but want to simulate how the client would work in a production setting.

❕ NB: TensorFlow must be installed to use the in memory client.

Wrapping it up

There are a few other prediction clients in open source GitHub repos out there. There are a few reasons I wanted to build my own:

I wanted a client that is generic and supports all types of models where you send a dict and get a dict in response. And it should be installable and ready to use out of the box.
Until now I’ve only found Python client code that use tf.contrib.util.make_tensor_proto to make the tensor proto. In my opinion the client should be TensorFlow agnostic. You shouldn’t have to install and import TensorFlow if you want to send requests to an already trained model.
When I found out I had to understand how this actually works, I wanted to build my own and share my knowledge with the community.

Alright, that’s it for now. I hope you learned something about TensorFlow Serving and how you can use this client in your own projects even outside TensorFlow environments :)

Thank you for reading. I really hope you learned something! Leave a comment if you have questions or comments 😃

In the end I want to thank a few people:

This is where I started about a year back: https://github.com/tobegit3hub/tensorflow_template_application. Thanks to tobegit3hub and this github issue.
There’s a node client here that’s also TensorFlow agnostic. I used the .proto files from here to generate my own client. Thanks to shygiants for writing a node client.
Thanks to dayzhang for open sourcinng dict-to-protobuf. Really helpful when converting a dictionary to protobuf without using tf.contrib.util.make_tensor_proto: https://github.com/davyzhang/dict-to-protobuf.