Tensorflow serving: REST vs gRPC

4 min readAug 5, 2018

A few months ago Tensorflow have released their RESTful API. As someone that experienced the pain of working with the a gRPC client, complicatedly preparing a request for any simple inference, I was very excited. Finally, a simple to use API. I decided to give it a try and test its performence a bit. following is my personal experience, and some code that may contribute people working with tensorflow serving.

I went with the simplest model,MNIST, and followed the instruction carefully. After a few tries I got a REST request sent and received the desired results. Following is a piece of working code for sending a REST request for the MNIST model.

Since one of the claims of using gRPC was speed, I wanted to run some tests and see how this API effects the inference rate. I decided to run some local tests and compare rates between the gRPC and REST APIs.

First, I needed to spine up a couple of containers with MNIST serving services. I decide to use the prebuilt tensorflow serving, and write a nice minimal Dockerfile in the process (235MB image).

I copied the Dockerfile to my system, and built the image:

$ cd [where you copied the Dockerfile]
$ docker build --build-arg http_proxy=${http_proxy} --build-arg https_proxy=${https_proxy} -t tensorflow-model-server .

Ran 2 containers, one for the gRPC and the other for the REST API:

$ docker run --rm -p 8500:8500 --name tf_serving_mnist_grpc -v TF-Serving/model/mnist_model:/model tensorflow-model-server tensorflow_model_server --model_name=model --model_base_path=/model$ docker run --rm -p 8501:8501 --name tf_serving_mnist_rest -v TF-Serving/model/mnist_model:/model tensorflow-model-server tensorflow_model_server --rest_api_port=8501 --model_name=model --model_base_path=/model

Running a simple benchmark on my machine (Macbook pro) with 1000 sync requests and a batch size of 100 images in each request had some surprising results. The inference rate was in favor of the REST API, tough, as expected, the payload of requests was twice the size when using REST. I run this test several times and got the same results.

REST
Inference rate: 1,729 img/sec
Network: 620 MB
gRPC
Inference rate: 1,239 img/sec
Network: 320 MB

The code I used to run this benchmark can be found here

This got me thinking that maybe the issue is with the serialization part of the request and not with the request protocol its self. As we all know gRPC is suppose to be much faster then REST in terms of the network.

Removing the serializations part from the gRPC, and sending the same prepared request over and over again indeed had increased the inference rate dramatically to 25,961 img/sec when using gRPC. Doing the same, and sending the same already serialized REST request (JSON) have increased the inference rate as well, but not as much, to 7,680 img/sec. Giving the advantage to using gRPC by a factor of ~3.5. This suggests that a lot of the overhead is in the transformation of the Numpy array into a tensor Protobuff or JSON. This actually made sense when working locally as the network bandwidth is less of an issue.

REST (serialized once)
Inference rate: 7,680 img/sec
Network: 620 MB
gRPC (serialized once)
Inference rate: 25,961 img/sec
Network: 320 MB

Now, the question was whether the object type and size being sent impact the serialization time and how. Indeed, checking only the preparation of the requests (both gRPC and REST) have shown that when using Numpy arrays as input gRPC is little slower then REST. Using a raw PNG image (basically a string) as input, REST seems to be much slower (X6) then gRPC.

REST (preperation only)
Image: 2,148 img/sec
Numpy array: 1,090 img/sec
gRPC (preperation only)
Image: 14,490 img/sec
Numpy array: 1,249 img/sec

These results suggest that conversion to Protobuff is impacted by the object structure, while the conversion to JSON is probably more straight forward in the REST case, and is effected much less by the object complexity, and probably more by its size.

I decided to investigate further, and compare the serialization of different size strings and Numpy arrays, as they are the most commonly used inputs in Tensorflow. The experiment was simple, prepare 10,000 REST or gRPC requests using either a Numpy array or String as input with 4 different input sizes (between 1000 and 1M characters or array entries).

The results showed that preparing a gRPC request with a String is much faster then with a Numpy array. In the REST case, the difference between Strings and Numpy arrays is much less noticeable in the preparation rate, but still in favor of the String.

The code used for the requests preparation benchmark can be found here.

To conclude, its seems that the serialization of Tensorflow Protobuff is less “consistent”, time wise, then the one to plain JSON, though it is more efficient size wise. I would test these on more complex objects, but for now, it seems that if you have simple big inputs then gRPC would be much faster. Having more complex objects as inputs (such as arrays and matrix), up until a certain size, REST with JSON should be faster (as we have seen in the MNIST example tested locally). However, the requests themselves (and probably their processing on the server side) are much faster using gRPC, so bandwidth should be put into the equation as the inputs size grows.

Tensorflow serving: REST vs gRPC

Written by Avidan Eran