Kubeflow v0.5 simplifies model development with enhanced UI and Fairing library

The Kubeflow Product Management Working Group is excited to announce the release Kubeflow v0.5, which brings significant improvements to users’ model development experience!

New features in Kubeflow 0.5 include:

  • A Go binary, kfctl, to simplify configuring and deploying Kubeflow
  • An improved UI for managing notebooks that makes it easy to:
  • Run multiple notebooks simultaneously
  • Attach volumes to notebooks
  • The Fairing library to build, train, and deploy models from notebooks or your favorite Python IDE

Demo Kubeflow 0.5 by building, training and deploying an XGBoost model

We thought the best way to illustrate Kubeflow 0.5’s improvements was with a walkthrough demonstrating how to leverage the new notebooks enhancements for interactive development of your XGBoost models.

Deploy Kubeflow

First off, we’ll kick off a fresh deployment of Kubeflow v0.5 on Google Kubernetes Engine using the web deployment application available at https://deploy.kubeflow.cloud/. For instructions on deploying onto other platforms, please see Getting Started with Kubeflow.

The screenshot below shows an example deployment form. Note that “Project field” is a GCP Project ID, with a deployment name of our choice. In this example, we opted to use Login with a Username/Password, and picked a username and password for the deployment to use. Also note that we left the Kubeflow version to the default v0.5.0. Then we clicked “Create Deployment,” kicking off the deployment of Kubeflow to the project. It will take roughly 10 minutes to be ready after you kick it off. Click on Show Logs to view the progress messages. In case you run into errors, please see detailed instructions for deployment.

Once the deployment is ready, the deployment web app page automatically redirects to the login page of the newly deployed Kubeflow cluster, as shown below.

Create a notebook server in Kubeflow

After logging in with the username and password we chose at deployment, we arrive at the updated Kubeflow Dashboard in v0.5:

Notice the build version displayed at the bottom left of the dashboard. This gives a quick confirmation of the version of Kubeflow deployed in your cluster.

In this demo we’ll focus on notebooks. Clicking on Notebooks in the left nav takes us to the new Notebooks management interface:

This is a new Kubernetes Native web app developed by the Kubeflow Community to improve the experience of creating and managing Notebook Servers in a Kubeflow deployment.

We’ll create a new TensorFlow 1.13 notebook server using one of the pre-configured images in Kubeflow by clicking “New Server” at the top-right.

Now we’ll provide a name for the notebook server (myserver in this example), pick the default Kubeflow namespace, and pick one of the standard TensorFlow notebook server images. We picked 1.0 for CPU and 5.0Gi for the memory. The new UI makes it really easy to create and attach new volumes, as well as existing volumes, to the Notebook Server. If you have a pre-configured NFS Server volume (your Admin team might have done that), you can easily discover it and attach the existing volume(s).

Once configured, we click “Spawn” and wait for the notebook server to get ready.

At this point, the pod is getting ready and pulling the specified container image. Once ready, the “Connect” button is highlighted on the notebook server, as shown below.

Clicking on “Connect” takes us to the Jupyter notebooks:

Note that, initially, there are no notebooks or terminals running.

Run an example notebook with Kubeflow Fairing

Fairing is a Kubeflow library that makes it easy to build, train, and deploy your ML training jobs on Kubeflow or Kubernetes, directly from Python code or a Jupyter notebook.

For this example, we’ll try running through one of the new Fairing example notebooks. In order to do that easily, here are the steps we follow:

  1. Create a new terminal.

2. Clone the fairing repo in the terminal.

$ bash
$ git clone https://github.com/kubeflow/fairing

3. In the terminal, run the following commands:

$ cd fairing/examples/prediction
$ pip3 install -r requirements.txt

4. Switch back to the notebooks view. Notice the fairing directory that shows up.

5. Browse to faring/examples/prediction directory. Click on xgboost-high-level-apis.ipynb

6. This opens the notebook in your notebook server.

7. Study the notebook and run through the notebooks cells.

Explore the notebook

The notebook is self-explanatory, and walks us through the development of an XGBoost-based model for a housing price prediction example. It illustrates how Fairing makes it extremely straightforward to develop your model.

Here are some of the core features to note:

  1. You can iterate and develop your model within the Notebook container. You can train your model within the Notebook server running on Kubeflow with a subset of the data.
  2. When ready to train full scale model, Fairing enables you to easily switch to a Kubeflow Backend Configuration to kick off a training job as a separate set of containers within the Kubeflow deployment. You can use this to train single node XGBoost or a distributed TensorFlow job. Fairing takes care of the following transparent to the user with a single python call to train:
    • The Fairing library automatically extracts your training code.
    • It builds a Docker container image automatically without needing to write a Docker configuration file.
    • Once the updated docker image is ready, it kicks off a training job on the Kubeflow cluster.
  3. The notebook also illustrates how a trained model could be easily deployed as a service in the Kubeflow cluster using a single python call. This leverages Seldon to wrap the python model into a container image for a flask application exposing the prediction endpoint.
  4. Finally, you can easily make predictions from within the notebook against the model just deployed in the previous step.

We hope it’s now clear how Fairing allows users to work through the entire Build/Train/Deploy lifecycle of a model from within Jupyter notebooks itself. If you have any feedback on this tutorial or something didn’t work as expected, please let us know by filing an issue in the Fairing repo.

More Details on Kubeflow v0.5: TFJob, PyTorch, Katib

In v0.5, UI improvements include a new Central Dashboard and a new sidebar navigation, which simplify user workflows and making it easier to access important functions. v0.5 also has API and integration improvements to TFJob and PyTorch, which deliver several low-level operational benefits:

  • Support Status subresource in CRD (#927, #924)
  • Add ActiveDeadlineSeconds and BackoffLimit (#550)
  • Use pod group instead of PDB for gang scheduling (#916)
  • Supporting multiple versions of CRD (#932)

v0.5 also includes valuable operational updates and improvements to hyper-parameter tuning in Katib:

  • Katib status should return optimal parameter values (#356)
  • An end-to-end test (#1946)
  • Make Katib generic for operator support (#341)
  • Removing Operator specific handling during a StudyJob run (#387)
  • Katib v1alpha2 API for CRDs (#381)
  • Katib job status should contain all conditions (#344)

You can see everything included in this release in the Kubeflow CHANGELOG.

What’s next

With this release under our belts, the community is starting to plan for the 0.6 release. Kubeflow v0.5 lays the groundwork for creating multi-user isolation by leveraging Istio and K8s namespaces: the multi-user functionality provides a new “Profiles” K8s Custom Resource, enabling dynamic per-user creation of namespaces so each user can run isolated by default. We anticipate the next release will provide a friendly application to enable self-service (or administered creation) of user profiles. Two other initiatives important in the next release are ksonnet replacement and preparing for Kubeflow 1.0.

For more information on what we’re working on, take a look at our Multi-User Critical User Journey (CUJ) and our roadmap for a stable and enterprise-ready Kubeflow 1.0.

Community-driven development

We put a lot of work in improving Kubeflow’s stability, fit and finish over 150+ closed issues and 250+ merged PRs. For this release, the community gathered extensive end-user input to inform our roadmap priorities and project board. We aggregated feedback from our recent Contributor Summit, the Kubeflow User Survey, and several reviews of the Customer User Journeys (CUJs) we use to define core experiences to build for each release. We’re happy with how this process has enabled community-driven development in Kubeflow and helped us to prioritize work that brings users value. We look forward to building on this process with the 0.6 release!

Finally, thanks to all who contributed to v0.5! Kubeflow is home to 100+ contributors from 20+ organizations working together to build a Kubernetes-native, portable and scalable ML stack, and we need even more help. Here’s how to get involved:

Thanks to Josh Bottum (Arrikto), Abhishek Gupta (Google), and Karthik Ramasamy (Google) for contributing to this post.