Taming Tensorflow Serving with Kubernetes for Dynamic Model Deployment
This blog post will focus on Spoke’s Tensorflow Serving (henceforth referred to as TFServing) setup to automatically serve customized models for new clients. Our previous blog post provided details on Spoke’s architecture and philosophy of deploying customized ML models for clients. To support Spoke’s use case, we combine TFServing and Kubernetes to dynamically serve new client models automatically.
The key takeaway of this blog post is how you can use Kubernetes to serve a dynamic list of models in TFServing. We will show how you can support event-driven (e.g. a new client using your product) automatic serving of new models in production within 24 hours with no downtime.
We will not go into details of the architecture of TFServing which can be found here. Here is a pretty good blog post that goes into the details of loading a single Tensorflow model in Tensorflow-Serving and querying it for predictions.
Why do we use Tensorflow Serving?
Our goal is to serve multiple machine learning models in production; in fact, the number of models we serve scales with the number of customers. Every time a customer joins us, we need to create new ML models to serve them.
Consequently, we need a serving infrastructure that scales to a large number of models while being highly available. Apart from TFServing, we considered a few choices for serving:
- Create a simple python server (using Flask, Django, etc.) or node server that can wrap a model and serve its predictions.
Pros: easy to serve once models to deploy are given; cheap
Cons: we will have to implement our own model version management, no easy way to ensure zero downtime while new models are being loaded
- Use a managed service like h2o.ai, Amazon Sagemaker, or Google Cloud ML.
Pros: easy to use; model management is taken care of
Cons: unlikely that they will allow serving a large number of dynamically changing ML models (E.g. Google Cloud ML blocked us at 100 models per project so for a given project, we could only serve 100 customers.)
Finally, we picked TFServing for a variety of reasons:
- Model version management: TFServing takes care of model version management. You only have to point it to the base directory for any given model and it picks up the latest version automatically. TFServing also allows serving multiple versions of the same model or going back to older versions.
- Highly Available: For any model, TFServing ensures that the old version keeps serving predictions while a new version is being loaded. This ensures that the clients making inference call face zero downtime.
- Performance: The serving code is a highly efficient code written in C++. It’s blazing fast and can handle thousands of RPM.
- Size: It has very little overhead. E.g. for models with 100k parameters, Tensorflow serving roughly takes 400 KB of memory which implies little overhead (100k floats ~ 400 KB).
- Well maintained: The library has the backing of one of the best engineering organizations in the world. Given Google is putting its weight behind Tensorflow, it is likely that this library will be continually maintained.
Loading new models dynamically
We will now describe our setup and the techniques for combining Kubernetes and Tensorflow for automatically serving new models in production.
Adding New Models
When a new client signs up for Spoke, we create a custom model for this client. Our model trainer is constantly scheduling training jobs for each of our clients. At the end of every run, it creates a new model config file with the list of all the models eligible to be served. The diagram below gives a high-level overview of our model training and serving architecture. This being the case, if we had a TF server which started serving models in the latest config every n hours, our problem would be solved. This is where the out-of-the-box version of TFServing does not work for us — it can pick up new versions of only a static list of models it was initialized with and thus cannot start supporting new models dynamically. We solved the problem of adding dynamic models to TFServing in a simple yet elegant way using Kubernetes.
Kubernetes to Reload TF Server
We built a setup where Kubernetes restarts our TF Server pod every 24 hours. Kubernetes does not have a built-in way to be restarted every
n hours. Jobs which run to completion comes close to what we need but does not work since they do not support the “no downtime” constraint. We found a simple way to solve this by attaching a
delayed-restart container to each TF Server which restarts the TF Server pods on a schedule. The
delayed-restart container runs a simple script that restarts all TF server pods:
Every Kubernetes pod has a service account which has access to the cluster. This makes things very easy. We issue a patch command which changes the value of an environment variable in the
tf-server deployment. This forces new
tf-serverpod(s) to be brought up.
tf-server pod starts, its first step is to pull the latest config from our model store and then starting TF Serving code. This results in
tf-server serving the latest config with a max delay of 24 hours. If you are on GCP, Google recently launched a scheduler which might be able to do the same thing.
As detailed in this paper and the figure below, actual model training is only a small part of the machine learning in production.
DevOps technologies for ML systems will become more important as the use of ML becomes more pervasive. Our technique of using Kubernetes for supporting TF Serving models is a great example of this.