Automated Model Retraining with Kubeflow Pipelines

How to implement a reproducible ML workflow that adapts to new data

Karl Weinmeister
kubeflow
6 min readJul 9, 2019

--

As the saying goes, “the only constant is change.” When we build ML models, we are predicting values based on data collected at a certain point in time. Over time, the pattern that the model sees may become fuzzier as conditions change.

One way to handle this situation is to retrain the model when there are significant changes in the data. While that can be helpful, manual retraining adds overhead, can be done at arbitrary times, or just forgotten about. Automated model retraining can help to ensure that models are retrained with the latest available data, based on a given frequency or other conditions.

We’ll explore how to implement automated model retraining using the Pipelines component of Kubeflow. Kubeflow is an ML platform that runs on the popular Kubernetes container orchestration system.

About the Scenario

We will use a sample that demonstrates how to create a simple Kubeflow pipeline using standard AI Platform components.

The model predicts the number of crime reports expected today, given the daily totals from the previous week. It uses Chicago Crime Data that is publicly available in BigQuery and includes crime reports in Chicago from 2001 to the present.

Exploring the Model

To get started, you will need to have a Kubeflow cluster available. You can set one up on Google Cloud Platform (GCP) using a deployment app. There are instructions available for many environments on the Kubeflow website, including a local environment. You will also need to clone the Github repository that contains the sample.

Once you have an environment setup, access the Kubeflow homepage on your cluster, and then access the Notebooks page. From there, create a notebook server, and upload both notebooks in the sample to the Notebook server.

The research notebook illustrates how the model works. The model uses a type of Recurrent Neural Network (RNN) called a Long Short-Term Memory (LSTM). Recurrent neural networks can predict on a sequence of input data, and the LSTM architecture provides a “memory” component that can improve model performance. The model is implemented with TensorFlow using the Keras LSTM class.

The notebook walks through the steps of creating the model, training it, and visualizing the output. In the graph below, the orange line shows the model predictions overlaid against the actual values in blue.

Visualization of historical crime reports along with predicted values

About the Pipeline

Now that we’ve looked at the model, let’s create a pipeline that will download the input data, train the model, and deploy it programmatically.

More specifically, the pipeline will consist of three tasks:

  1. Query historical crime reports from BigQuery and save as a CSV into a Google Cloud Storage bucket.
  2. Train the model using the AI Platform Training service.
  3. Deploy the model to the AI Platform Prediction service.

You can find more information about how Kubeflow Pipelines works in the documentation, but here’s a quick introduction. A pipeline is defined in a Python-based Domain Specific Language (DSL) that describes its steps. The DSL code then needs to be compiled into an intermediate format with the Pipelines SDK, so it can be used by the Kubeflow Pipelines workflow engine.

Each task in the pipeline will be executed in its own Docker container. You can implement each task with custom code or use standard components. In this example, we’ll be using GCP AI Platform components for each task in the pipeline.

Creating the Pipeline

To get started, you will need to create a Google Cloud Platform project if you don’t already have one. To store the model assets, you will need to create a Google Cloud Storage bucket in your project.

Let’s now build and execute the pipeline. In the pipeline notebook, start with updating the required parameters in the Constants section to match your environment. You can now run each step in the notebook. The final step will give you a run link to view the pipeline in the Kubeflow Pipelines UI.

Pipeline definition code is compiled into a visual workflow

Next, we will add this pipeline to the library of pipelines. The first step is to download the chicago.pipeline.tar.gz pipeline from the Jupyter Notebook service. This pipeline file was created when the SDK compiled the pipeline.

From the Pipelines tab in the Kubeflow Pipelines UI, let’s upload this pipeline file. Let’s give it a name like “Chicago Crime Pipeline”.

Upload a pipeline to begin the process

Retraining the Model

We will now setup a recurring run with the pipeline we just created. It is also possible to trigger retraining from an external eventing service, when the data has significantly changed or grown, for example. A recurring run should consider the rate of change in the data. It should be frequent enough, while avoiding unnecessary retraining.

Let’s go to the Experiments tab and locate the experiment you created from your pipeline notebook. Click on the experiment, and create a new “Recurring Run”. From this page, you can select the “Chicago Crime Pipeline” and specify the retraining frequency. The periodic option is for interval-based scheduling of runs (e.g. every 2 weeks), while the cron option provides similar functionality but accepts cron semantics for defining the schedule.

The job supports general parameters that are defined in the pipeline. These parameters allow you to run different permutations of your pipeline, such as with different hyperparameters or a different version of TensorFlow, that are consistent and tracked. The example pipeline provides default values for each parameter, so that jobs can be started easily.

Set parameters on the pipeline, including how often it runs

You’ve now created a recurring run! Each time the pipeline is executed, it will deploy a new version of the model to the same location built using a consistent environment.

You can now see details for each run in the Pipelines dashboard. Depending on how you’ve configured your project, you can access model accuracy statistics in the user interface or in TensorBoard for each run.

Each pipeline run can be viewed in the Kubeflow Pipelines user interface

Summary

Our Chicago crime prediction model illustrates a typical scenario that would benefit from model retraining. We showed how to create a three-step pipeline that downloads the latest data, retrains the model, and deploys the model automatically. Finally, we showed how to configure this pipeline to be run on a recurring interval with Kubeflow Pipelines.

I hope this article shows that building a repeatable ML workflow is practical and accessible. While we demonstrated model retraining on GCP, any arbitrary process on any Kubernetes cluster can be configured in Kubeflow Pipelines. We hope that Kubeflow can help provide reproducibility and tracking for your ML experiments!

--

--

Karl Weinmeister
kubeflow

Head of Product Developer Relations, Google Cloud