TensorFlow Serving of Multiple ML Models Simultaneously to a REST API Python Client

Published in

The Blog of RETINA-AI Health, Inc.

7 min readOct 25, 2019

The final mile in any machine learning project is deployment of the solution so that it can do what it was created to do: improve the lives of people. Once our models have been trained and we are satisfied with the model accuracy, the next thing is to deploy. And if our intention is to deploy into production at scale, then TensorFlow Serving on GPUs is currently the way to go. Additionally, depending on the complexity of the data science solution, multiple models are often used in ensemble or in linear cascade. Here I discuss how to:

Setup a TensorFlow model Server on a GPU-enabled machine
Host multiple models on the server simultaneously, and
Send image classification requests to the server from a RESTful API python client.

A) BASIC INSTALLATIONS:

To use GPUs you cannot use garden variety docker, but will instead need to install a version of nvidia-docker (1 or 2). I strongly recommend nvidia-docker 2. Of note, nvidia-docker has been deprecated and is already having compatibility issues with TF versions etc. If you have nvidia-docker on your machine, you will need to remove it with the following code (If you don’t have nvidia-docker already installed then skip to next block). The following is the nvidia-docker 2.0 installation process for Ubuntu 18.04 operating system which is what I use:

First, remove nvidia-docker (if you have it)

$docker volume ls -q -f driver=nvidia-docker | xargs -r -I{} -n1
$docker ps -q -a -f volume={} | xargs -r docker rm -f
$sudo apt-get purge nvidia-docker

Next, curl in the git repositories

$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | \
sudo apt-key add -

$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)

$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list

$ sudo apt-get update
$ sudo apt-get install -y nvidia-docker2
$ sudo pkill -SIGHUP dockerd

NEXT Run NVIDIA-SMI to verify that everything works, and to ensure compatibility between CUDA version and nvidia-driver version. On my machine I got the following:

On my system, I had to particularly install a driver (430.26) that was compatible with my CUDA version 10.2. The table below shows what nvidia driver versions are compatible with what CUDA versions:

B) SETUP TENSORFLOW SERVING

Obtain latest tensorflow/serving docker image for gpu by the following command:

$ docker pull tensorflow/serving:latest-gpu

A successful pull process will look as follows:

Next, clone the tensorFlow serving repository as follows:

$ mkdir -p /tmp/tfserving
$ cd/tmp/tfserving
$ git clone https://github.com/tensorflow/serving

C) MODEL SERVER FOR SINGLE MODEL

To setup the server to serve a single model, enter the following command:


docker run --runtime=nvidia -p 8501:8501 \
 --mount type=bind,source=/tmp/tfserving/serving/tensorflow_serving/servables/tensorflow/testdata/saved_model_half_plus_two_gpu,target=/models/half_plus_two \
 -e MODEL_NAME=half_plus_two -t tensorflow/serving:latest-gpu &

The above code can be understood as follows, piece-by-piece: It instructs for gpu usage (--runtime=nvidia) and opens up a port 8501 designated for RESTful messaging. It then binds the docker container's 8501 port to the host machine's 8501 port (-p 8501:8501). Next the location of the ML model on the host machine is bound to the location to which it will be copied onto in the docker container (--mount type=bind,source=<model location on host>,target=<model location in container>). Next the environmental variable representing the model name is explicitly changed to whatever name we've chosen to call our model (-e MODEL_NAME=<model name>). And finally we specify that we are running tensorFlow serving (-t tensorflow/serving:latest-gpu). Notably, each of the pieces in the code have default settings which we should be aware of. For instance, target location of model in docker container defaults to /models/model, i.e model_base_path = /models/model by default.

A successful run of the above code will yield a screen that looks like this:

Server successfully launched and listening on port 8501 for REST messages

D) PYTHON CLIENT FOR SINGLE MODEL

For the python client we’ll use the requests package as shown below.

the base64 package is used to convert our image into a b64 string which is arranged into a json format as shown. The json predict request is the data parameter in a requests.post message we send to the server. The server is reached via its URL, which in this case is ‘http://localhost:8501/v1/models/resnet:predict' Dissecting the URL we see the server is running on the local host, listening for RESTful messages on port 8501, and is ready to serve version 1 of the model located in directory models/resnet of docker container. The request is for a ‘prediction.’ The prediction returns in ‘resp’ and can be exposed by examining a utf-8 decoding of its contents.

E) MODEL SERVER FOR MULTIPLE MODELS

The Model_Config_File is the key ingredient needed to setup a tensorFlow server that can hold multiple models in the same docker container and serve through a common port.

There are a number of ways to implement this. My preferred approach is to launch the server de novo in such a manner that all ports and resources are bound and wired at launch time. This then enables us use essentially the same python client we used for the single model case. This approach is therefore both easier to set-up and easier to use. The alternative approach requires a gRPC client, and involves several more steps at set-up. To proceed with our preferred approach use the following command:

The server launch command is similar to the single model case with the exception of model_config.config. This model config file is bound from the host location to its desired location on the container. And as shown, it is also specified as an argument into the function.

Successful launching of this server should look something like this:

Successful launching of TensorFlow Server with multiple models

F) PYTHON CLIENT FOR MULTIPLE MODEL

The python client for multiple model server is essentially identical to that for single model. The only difference is that you specifically call the model you want by indicating it in the URL of the request. As shown:

SUCCESS! Together, we have setup a tensorFlow server on a GPU-enabled host machine; we’ve hosted two different machine learning models on the server and exposed a single RESTful port, 8501; we queried each of the ML models on the server about the class of an image; and finally, we received prediction responses which we successfully examined.

G) CONCLUSION

The above is a demonstration of how to do TensorFlow Serving of multiple ML models to a python RESTful API client. This approach to ML model serving combines the scalability and production-amenability of TensorFlow Serving with the ease-of-use of RESTful API client-side. Hope you found this article helpful.

REFERENCES
nvidia-docker2 installation
https://devblogs.nvidia.com/gpu-containers-ruoutput of ntime/nvidia-driver CUDA compatibility
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.htmlnvidia-docker2 installation:
https://medium.com/@sh.tsang/docker-tutorial-5-nvidia-docker-2-0-installation-in-ubuntu-18-04-cb80f17cac65On getting Server running
https://www.tensorflow.org/tfx/serving/docker

BIO

Dr. Stephen G. Odaibo is CEO & Founder of RETINA-AI Health, Inc, and is on the Faculty of the MD Anderson Cancer Center. He is a Physician, Retina Specialist, Mathematician, Computer Scientist, and Full Stack AI Engineer. In 2017 he received UAB College of Arts & Sciences’ highest honor, the Distinguished Alumni Achievement Award. And in 2005 he won the Barrie Hurwitz Award for Excellence in Neurology at Duke Univ School of Medicine where he topped the class in Neurology and in Pediatrics. He is author of the books “Quantum Mechanics & The MRI Machine” and “The Form of Finite Groups: A Course on Finite Group Theory.” Dr. Odaibo Chaired the “Artificial Intelligence & Tech in Medicine Symposium” at the 2019 National Medical Association Meeting. Through RETINA-AI, he and his team are building AI solutions to address the world’s most pressing healthcare problems. He resides in Houston Texas with his family.

www.retina-ai.com