Kubeflow Serving: Serve your TensorFlow ML models with CPU and GPU using Kubeflow on Kubernetes

Published in

intelligentmachines

8 min readJan 1, 2020

Kubeflow serving gives you a very easy and straight forward way of serving your TensorFlow model on Kubernetes using both CPU and GPU. Kubeflow documentation has the necessary configuration files to get you started. But its not very beginner friendly. There are few things that you need to take care of before you can just CTRL C + CTRL V the yaml configurations and expect it to work. In this article I will explain the necessary steps and tweaks to make sure the stuff written in docs gives you results.

In this article we will be using one of the pre-trained models from TensorFlow Object Detection API model zoo so that the readers can just read along and follow each of the steps to get things working. We will be using ssd_resnet_50_fpn_coco model for serving on your Kubeflow cluster.

If you want to serve your custom trained model, you can read along as well. You will just need to execute two extra steps. That is

Export the target checkpoint
rename the exported saved_model directory to a numeric value.

Don’t worry about these. I will explain all these in much more detail in the following part of the article.

If you are not interested in serving your model using Kubeflow Serving and are just looking for a way to serve your recently trained TensorFlow model locally and on Kubernetes then dive right into this article. It’s a great article where the author François Paupier explains the basics of serving a TensorFlow model in your local machine and also in Kubernetes (without Kubeflow Serving).

This article is for those who are specifically interested in serving with Kubeflow TensorFlow Serving.

TL;DR(Too Long; Didn’t Read)

Step 1:

Download the model ssd_resnet_50_fpn_coco from TensorFlow Object Detection API model zoo.

Step 2:

Inside the model directory, rename the saved_model directory to a numeric value. We will rename saved_model directory to 00001

Step 3:

Now upload the directory named 00001 to a Google Cloud Storage(GCS) bucket and get the GCS link.

Step 4:

Update the model_base_path property in the yaml files given below with the GCS path.

Step 5:

Deploy the yaml configuration files on Kubeflow cluster on Kubernetes.

Step 6:

Test out the serving by sending prediction request to the REST endpoint of your serving.

Prerequisites

Kubeflow Cluster Setup

Before you can follow along, you need setup a Kubeflow cluster on GCP using this link. This link guides you to setup Kubeflow on Google Cloud Platform with a few minutes.

When you setup Kubeflow using GCP UI then the service accounts required to access Google Cloud Storage is created during Kubeflow setup time and stored as a Kubernetes secret with the name user-gcp-sa. If you have setup Kubeflow using CLI then you have to create the service account and store the json file as a kubernetes secret yourself. To check if you have a secret named user-gcp-sa in you Kubernetes cluster, use the following command

kubectl get secrets -n kubeflow

Or using Google Cloud Console UI

Google’s Cloud Console > Kubernetes Engine > Configuration

and search for a secret named user-gcp-sa.

Environment Setup

If you have Kubeflow cluster setup using the GCP UI, then you are ready to try out Kubeflow CPU serving immediately.

When you try to perform GPU serving, you will need to setup a node pool with GPUs. For that you need to follow two steps.

Step 1: Add GPU Node Pool

You can setup a GPU node pool using the command below.

create-gpu-node-pool.sh

Or you can add a GPU node pool from Google Cloud Platform(GCP) console UI.

You can check the available node pools in your Kubernetes cluster using the command

gcloud container node-pools list --zone us-central1-a --cluster <CLUSTER_NAME>

List available node pool in your cluster.

Step 2: Installing NVIDIA GPU Device Drivers

Only adding GPU node pool isn’t enough. According to Google Clouds documentation

After adding GPU nodes to your cluster, you need to install NVIDIA’s device drivers to the nodes. Google provides a DaemonSet that automatically installs the drivers for you.

To deploy the installation DaemonSet, run the following command:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml

The DaemonSet that automatically installs the NVIDIA GPU drivers for your cluster.

After you run the command, a Daemon Set will be installed in your Kubeflow cluster. This Daemon Set will install the necessary drivers for your Pod or Deployment to use the GPUs.

And now we are finally ready for…

Serving The Model with Kubeflow

Now we have everything we need to serve a TensorFlow model. Just follow the steps below.

Step 1: Download a Pre-Trained Model or Get Your Custom Model

You can download any one of the models from TensorFlow Object Detection API model zoo. In this article we will be using the ssd_resnet_50_fpn_coco model for serving. Click this link to download the model. Or you can use the following command

wget http://download.tensorflow.org/models/object_detection/ssd_resnet50_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03.tar.gz

The downloaded model has a folder inside name saved_model that contains the exported model which you can use directly for serving.

If you are planning to use your own trained model, then you need to export your model before you can serve it. You can follow the documentation in TensorFlow Object Detection API’s GitHub repo for steps needed to export a model.

Step 2: Rename the saved_model Directory

When you download the model and extract the zip file, you will find it containing the following files:

Files and folders inside the downloaded pre-trained model.

checkpoint: a file specifying to restore included checkpoint files.
frozen_inference_graph.pb: the frozen graph format of the exported model.
model.ckpt.*: the model checkpoints used for exporting.
pipeline.config: pipeline config file for the exported model.
saved_model/: a directory containing the saved model format of the exported model.
saved_model/saved_model.pb: pre trained models inference graph.

The saved_model folder contains the exported model frozen graph ready to serve with tensorfolw/serving or tensorflow/serving:latest-gpu docker images. But both of these Docker images expects that the saved_model.pb file will be stored in a folder with a numeric name. So we will rename the saved_model folder to 00001.

Step 3: Upload the 00001 Directory to GCS

We need to upload the newly renamed 00001 directory to a Google Cloud Storage Bucket so that Kubeflow Serving can access it. Use the following command to upload the directory to GCS

gsutil -m cp -r 00001 gs://<BUCKET_NAME>

Step 4: Update the Config Files

Kubeflow documentation provides all the necessary yaml configuration files for you to serve up your saved model. But I had to do a little bit of tweaking to get it running. I made the following changes to the yaml configuration file given in Kubeflow documentation

Changed the service type from ClusterIP to LoadBalancer.
Updated the model_base_path property with gs://<BUCKET_NAME>
Added replicas and selector in the Deployment spec section. At the start of writing this article I made a pull request on Kubeflow Website to add selector in the Deployment spec. They have approved the PR. Now you can find the selector tag in the Kubeflow docs as well.

spec:
    selector:        matchLabels:            app: mnist
    template:
        ...

4. In the Deployment configuration, I have also commented the volume and volumeMount with the name config-volume. The Initial setup of the Kubeflow cluster doesn’t contain any volume named config-volume. So when I tried to deploy serving, it always got stuck in ContainerCreating state without giving any error. So I had to comment the volume and volumeMount with the name config-volume to get it working properly.

CPU Serving

After making this four changes, the final yaml file for CPU serving turns out like this

kubeflow-tensorflow-cpu-serving.yaml

GPU Serving

For GPU serving we need to make one additional change to the Deployment configuration. We need to update the image name to the following.

image: tensorflow/serving:latest-gpu

After making this these changes, the final yaml file for GPU serving turns out like this

kubeflow-tensorflow-gpu-serving.yaml

Step 5: Deploy these yaml Files on Kubeflow Cluster

Now you can simply deploy these resources using the following command

# CPU serving
kubectl apply -f kubeflow-tensorflow-cpu-serving.yaml# GPU Serving
kubectl apply -f kubeflow-tensorflow-gpu-serving.yaml

It will take few minutes to get all these components to get deployed on Kubeflow Cluster. So wait for a few minutes before you start sending the prediction requests.

Step 6: Sending Prediction Request

Let’s test out our model serving by sending prediction request. But before we can send prediction request, we need to know where to send it. In the yaml configuration files above, we have used service type LoadBalancer. It will give us a external IP address where we can our prediction request to. Using the following command, you can get the external IP by:

kubectl get svc mnist-service --namespace kubeflow

Getting the external IP address of the serving service.

So our URL to send prediction request to will be

http://EXTERNAL_IP:8500/v1/models/mnist:predict

You need to explicitly mention the namespace flag if you are not in the same namespace where these resources were deployed. If you don’t specify the namespace then the command above takes default namespace by default and will give you an error message like

couldn’t find the service mnist-service.

Now that we have the IP address, we can send out request for prediction. It will have two different ports exposed.

PORT 8500: This port is TensorFlow Serving’s REST API port.
PORT 9000: This port is TensorFlow Serving’s gRPC API port.

We will be using the REST API PORT 8500 for sending prediction requests.

We are going to use a python script to send prediction request from a public GitHub repository created by François Paupier. The repository is very simple and self-explanatory. It has a great documentation. If you follow the steps as is you will get to send prediction request in no time.

But to make sending prediction requests even simpler, I have dockerized the python script. You can use the dockerized version of the script to send prediction request without the hassle of setting up loads of packages and environment variables.

You can build the Dockerfile yourself. But I have built the docker image and made it available in a public Google Cloud Registry which you can use directly.

asia.gcr.io/im-mlpipeline/tensorflow-serving-sidecar-client:latest

You can use the following command to send prediction requests.

tensorflow-serving-prediction-request-client.sh

Update the export variables and execute the command. Within a few seconds you will see some logs similar to,

Logs after successfully sending a prediction request.

If you see output like these, give me a knock and I will pat myself on the back for guiding you through all these successfully.

Just Kidding…! :P