Serving machine learning models using TensorFlow Serving

Published in

Mindboard

3 min readJun 5, 2019

This post is a continuation of the previous post where we deployed a machine learning application using Docker, Flask, ngnix and gunicorn. In this post, we will explore how TensorFlow models can be served using TensorFlow Serving.

TensorFlow Serving: It is an open-source platform software library for serving machine learning models. Its primary objective based on the inference aspect of machine learning, taking trained models after training and managing their lifetimes. It has out-of-the-box support for TensorFlow models.

Serving TensorFlow models with Docker: One of the easiest ways to serve machine learning models is by using TensorFlow Serving with Docker. Docker is a tool that packages software into units called containers that include everything needed to run the software.

The first step is to install Docker CE. This will provide you all the tools you need to run and manage Docker containers.

TensorFlow Serving uses the SavedModel format for its ML models. A SavedModel is a language-neutral, recoverable, hermetic serialization format that enables higher-level systems and tools to produce, consume, and transform TensorFlow models.

For simplicity lets say we trained a model to classify handwritten digits on the mnist dataset using Keras. The model is trained and now we have a model file in “.h5” format or any other format.

Now that we have our model, serving it with Docker is as easy as pulling the latest released TensorFlow Serving serving environment image, and pointing it to the model:

$ docker pull tensorflow/serving
$ docker run -p 8000:8001 — name mnist_classifier \
— mount type=bind,source=/tmp/mnist,target=/mnist_models/mnist \
-e MODEL_NAME=mnist_v1 -t tensorflow/serving &
…
… main.cc:327] Running ModelServer at 0.0.0.0:8000…
… main.cc:337] Exporting HTTP/REST API at:localhost:8001 …

Breaking down the command line arguments, we have:

· -p 8000:8001 : Publishing the container’s port 8001 (where TF Serving responds to REST API requests) to the host’s port 8001

· --name mnist_classifier: Giving the container we are creating the name “mnist_classifier” so we can refer to it later

· --mount type=bind,source=/tmp/mnist,target=/mnist_models/mnist: Mounting the host’s local directory (/tmp/mnist) on the container (/models/mnist) so TF Serving can read the model from inside the container.

· -e MODEL_NAME=mnist_v1 : Telling TensorFlow Serving to load the model named “mnist_v1”

· -t tensorflow/serving : Running a Docker container based on the serving image “tensorflow/serving”

Next step is to have a client which can handle the request and get back predictions.

# The server URL specifies the endpoint of your server running the ResNet

# model with the name “resnet” and using the predict interface.

SERVER_URL = ‘http://localhost:8001/v1/models/mnist_v1:predict'

response = requests.post(SERVER_URL, data=predict_request)

prediction = response.json()[‘predictions’][0]

where data can take the format:

{

“instances”: [

{ “b64”: image }

]

}

The process of deploying the TensorFlow models using TensorFlow Serving and Docker is pretty straight forward. You can even create your own custom Docker image that has your model embedded, for even easier deployment.

In the next article, we will see how to improve the performance by using by building an optimized serving binary and serving with Docker using GPUs.

Masala.AI
The Mindboard Data Science Team explores cutting-edge technologies in innovative ways to provide original solutions, including the Masala.AI product line. Masala provides media content rating services such as vRate, a browser extension that detects and blocks mature content with custom sensitivity settings. The vRate browser extension is available for download via the Chrome Web Store. Check out www.masala.ai for more info.

Serving machine learning models using TensorFlow Serving

In the next article, we will see how to improve the performance by using by building an optimized serving binary and serving with Docker using GPUs.

Written by Guru Prasad Natarajan