Kubeflow 0.7: Beta functionality in advance of v1.0

Published in

kubeflow

9 min readNov 11, 2019

We’re excited to announce Kubeflow 0.7, which is enabling industry leaders to build and scale their machine learning workflows. This release delivers beta functionality for many components as we move towards Kubeflow 1.0. With 0.7, we are also delivering an initial release of KFServing to handle inference and complete the build, train, deploy user journey.

We’re happy to see users already benefitting from our Kubeflow 0.7:

“Kubeflow v0.7 provides valuable enhancements that simplify our ML operations, especially on hyperparameter tuning with the new suggestion CR.” — Jeremie Vallee, AI Infrastructure Engineer, BabylonHealth.
“We appreciate the Kubeflow Community’s on-going efforts to mature the project, and v0.7 enables us to deliver a wide range of ML models to our diverse customers and their analytical use cases.” — Tim Kelton, Co-Founder, Descartes Labs, Inc.

Read on to find out what’s new, what’s changed, and what’s coming.

What’s new in v0.7

Kubeflow 0.7 adds significant enhancements in KFServing 0.2, Kubeflow’s model serving component now officially in alpha.

Model serving and management via KFServing

KFServing enables serverless inferencing on Kubernetes and provides performant, high-abstraction interfaces for common ML frameworks like Tensorflow, XGBoost, ScikitLearn, PyTorch, and ONNX to solve production model serving use cases.

KFServing:

Provides a Kubernetes Custom Resource Definition (CRD) for serving ML models on arbitrary frameworks.
Encapsulates the complexity of autoscaling, networking, health checking, and server configuration to bring cutting edge serving features like GPU autoscaling, scale to zero, and canary rollouts to your ML deployments
Enables a simple, pluggable, and complete story for your production ML inference server by providing prediction, pre-processing, post-processing and explainability out of the box.
Is evolving with strong community contributions, and is led by a diverse maintainer group represented by Google, IBM, Microsoft, Seldon, and Bloomberg.

Please browse through the KFServing GitHub repo and give us feedback!

Get Started

Once you have Kubeflow 0.7 deployed, you can easily create a KFServing resource to deploy a model. First, create a YAML file model.yaml with the following spec:

And then deploy it:

kubectl apply -f model.yaml

KFServing also offers a breadth of examples showing usage for different frameworks and use cases. Check them out below:

What’s changed in v0.7

Kubeflow 0.7 features several improvements to major components, including kfctl, monitoring, notebooks, operators, metadata, katib, pipelines, role-based access control (RBAC), Kubeflow SDKs, and documentation.

Kfctl

v0.7 includes several improvements to kfctl, Kubeflow’s command line interface for building and deploying Kubeflow clusters:

A simplified syntax to build and deploy Kubeflow configurations, for example, to install on an existing Kubernetes cluster:

mkdir ${KFNAME}
cd ${KFNAME}
CONFIG=https://raw.githubusercontent.com/kubeflow/manifests/v0.7-branch/kfdef/kfctl_existing_arrikto.0.7.0.yaml
kfctl apply -V  ${CONFIG}

A v1beta1 KFDef API in preparation for Kubeflow 1.0. This API allows users to fully describe their Kubeflow deployments declaratively using a Kubernetes-style resource.

Kubeflow Applications and Monitoring

All Kubeflow applications now have application resources defined, making it easy to see which version of which applications are installed:

kubectl -n kubeflow get applicationsNAME AGE
api-service 26h
argo 26h
bootstrap 26h
centraldashboard 26h
cloud-endpoints 26h
gpu-driver 26h
jupyter-web-app 26h
katib 26h
kfserving 26h
kfserving-crds 26h
kubeflow 26h
minio 26h
mysql 26h
notebook-controller 26h
persistent-agent 26h
pipeline-visualization-service 26h
pipelines-runner 26h
pipelines-ui 26h
pipelines-viewer 26h
profiles 26h
pytorch-job-crds 26h
pytorch-operator 26h
scheduledworkflow 26h
seldon-core-operator 26h
spartakus 26h
tf-job-crds 26h
tf-job-operator 26h
webhook 26h

See application details:

Name:         notebook-controller
Namespace:    kubeflow
Labels:       app.kubernetes.io/component=notebook-controller
              app.kubernetes.io/instance=notebook-controller-v0.7.0
              app.kubernetes.io/managed-by=kfctl
              app.kubernetes.io/name=notebook-controller
              app.kubernetes.io/part-of=kubeflow
              app.kubernetes.io/version=v0.7.0
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"app.k8s.io/v1beta1","kind":"Application","metadata":{"annotations":{},"labels":{"app.kubernetes.io/component":"notebook-con...
API Version:  app.k8s.io/v1beta1
Kind:         Application
Metadata:
  Creation Timestamp:  2019-10-30T23:36:45Z
  Generation:          12456
  Resource Version:    7954811
  Self Link:           /apis/app.k8s.io/v1beta1/namespaces/kubeflow/applications/notebook-controller
  UID:                 223c7bd6-fb6e-11e9-8802-00505685d108
Spec:
  Add Owner Ref:  true
  Component Kinds:
    Group:  core
    Kind:   Service
    Group:  apps
    Kind:   Deployment
    Group:  core
    Kind:   ServiceAccount
  Descriptor:
    Description:  Notebooks controller allows users to create a custom resource \"Notebook\" (jupyter notebook).
    Keywords:
      jupyter
      notebook
      notebook-controller
      jupyterhub
    Links:
      Description:  About
      URL:          https://github.com/kubeflow/kubeflow/tree/master/components/notebook-controller
    Maintainers:
      Email:  lunkai@google.com
      Name:   Lun-kai Hsu
    Owners:
      Email:  lunkai@gogle.com
      Name:   Lun-kai Hsu
    Type:     notebook-controller
    Version:  v1beta1

Notebooks

Kubeflow 0.7 adds the foundations for improved operations, scalability, and isolation for Kubeflow notebooks, one of our most popular features. The notebook custom resource has graduated to v1beta1 in preparation for Kubeflow 1.0. We’ve also added and improved notebook images, including images for TensorFlow 2.0 images with GPU.

ML Operators

Kubeflow is built on custom resources, which are handled in Kubernetes by automation software known as operators. These operators simplify cluster configuration, performance tuning, and the lifecycles of distributed training jobs. In v0.7, several Kubeflow operators are graduating towards v1.0 status and updated to support kustomize manifests:

TFJob 1.0
Pytorch 1.0
XGBoost, Beta
MXNe, Beta

Hyperparameter Tuning and Metadata

Kubeflow improves MLOps workflows and reduces development costs by simplifying analysis and iteration, especially for parameter configuration. Comparing metadata and hyperparameters in Katib (our native hyperparameter tuning component) and Pipelines speeds up model parameter analysis and parameter dependencies, as demonstrated in a recent Kubeflow Summit presentation (video to be released).

The Katib v1alpha3 API includes a new Suggestion CR which tracks the suggestions generated for each trial of an experiment by the corresponding algorithm service. The system architecture has been redesigned to make it more Kubernetes-native with better support for events. The v1alpha3 API also comes with a new metric collector implementation. Users can specify the metric collector type in the experiment config. Currently supported types are StdOutCollector(default), FileCollector and TFEventCollector. The internal database implementation is also refactored to support multiple database backends. This release also supports Prometheus metrics that includes controller runtime metrics and creation/deletion/success/failure counters for experiments and trials.

Kubeflow 0.7 also delivers these improvements to Metadata:

A logger for automatic logging of resources to metadata.
A ML-MetaData gRPC server in addition to the current backend server for Kubeflow 0.7. (In the long run, we want to use MLMD server directly instead of making our own backend server.)
A watcher for Kubernetes resources as a way to extract & log metadata. (This is an initial proof of concept (POC) and won’t installed with Kubeflow v0.7 by default. We want to further integrate with other Kubeflow CRDs in the future.)

Kubeflow Pipelines

Per the Fall 2019 Kubeflow User Survey, Kubeflow Pipelines is the most popular Kubeflow component, and the community has been busy adding new scale, performance, and functional improvements. Kubeflow 0.7 adds several metadata enhancements, samples and integrations with other ML components to Pipelines:

A set of performance/scale improvements: more efficient API calls, Cloud SQL support, garbage collection, etc.
A “Retry” button which enables a user to re-run the failed pipeline from failure point.
File I/O simplification for platform independent data exchange.
Metadata-driven orchestration for TFX on KFP, which enables logging of pipeline artifacts and executions in the metadata store and enables cache-based execution of pipeline such that results from previously successful pipeline steps can be reused.
WithItems support allowing end users to define a loop over a static set of parameter values in DSL.
WithParams support allowing end users to define a loop over a dynamic set of values, either assigned at runtime or created as the result of a container operation.
Free-form artifact visualization with Jupyter kernel.
Airflow component for any airflow operations in KFP.
Built-in MLMD support through a deployed MLMD gRPC server, as well as support in KFP’s UI for displaying pipeline metadata (Artifacts & Executions).
A standalone KFP deployment so that it’s easier to manage the deployment.
Enriched/enhanced sample inventory with AutoML pipeline, dataflow pipeline, CMLE TPU, xgboost samples with new GCP components, and etc.
A new TFX sample pre-loaded in the KFP UI.

Multiuser Support

The profile resource in Kubeflow allows users to easily manage namespaces in which they consume Kubeflow. In v0.7, the profile resource is beta and we have made the following improvements.

Aggregated RBAC

Kubeflow 0.7 now defines three aggregated cluster roles that can be used to grant admin, edit, and view permissions across all Kubeflow applications

kubectl get clusterrolekubeflow-admin 26h
kubeflow-edit 26h
kubeflow-view 26h

As the sample output below shows, the kubeflow-viewer role combines permission across different resources corresponding to different applications, like notebooks and tfjob.

Granting Permission to Use Kubeflow

The new aggregated roles make it easy to grant permissions to consume all the different applications in Kubeflow.

Using kubectl:

kubectl apply -n $PROFILE_NAMESPACE -f - <<EOFapiVersion: rbac.authorization.k8s.io/v1kind: RoleBindingmetadata: annotations:   role: view   user: $CONTRIBUTOR_EMAIL name: user-gmail-com-clusterrole-editroleRef: apiGroup: rbac.authorization.k8s.io kind: ClusterRole name: kubeflow-viewsubjects:- apiGroup: rbac.authorization.k8s.iokind: Username: $CONTRIBUTOR_EMAIL

Kubeflow also includes a UI which makes it easy to share access to an existing namespace.

The UI will create role bindings to role kubeflow-edit for all contributors; granting those users the ability to consumer all Kubeflow resources in that namespace.

SDKs

To provide a more consistent experience, the fairing and metadata python SDKs are now subpackages in the kubeflow python namespace. This means they can now be imported as follows:

from kubeflow import fairingfrom kubeflow import metadata

The fairing and metadata packages still need to be installed separately.

Workload Identity

On Google Cloud Platform (GCP), we’ve switched to using workload identity to make GCP credentials available to pods. Workload identity allows Google service accounts(GSA) to be associated with Kubernetes service accounts(KSA). Pods running with the KSA can then authenticate to GCP APIs using that GSA. This eliminates the need to manually download and store credentials for GSAs as Kubernetes secrets. The Kubeflow profile controller takes care of automatically binding the Kubeflow service accounts to suitable Google service accounts. More information is available in the docs.

Documentation

The Kubeflow 0.7 docs include:

New Python-based visualizations in Kubeflow Pipelines.
An initial guide to KFServing.
Metadata API reference.
Improved guides to deploying Kubeflow on existing Kubernetes clusters (on-prem), GCP, AWS, Azure, and IBM Cloud Private.

Doc contributions are welcome!

What’s coming

The Kubeflow community’s primary focus will now be defining and delivering Kubeflow 1.0, especially for kfctl, notebooks, pipelines, operators and metadata. We expect further maturing of the other components and the release upgrade functionality, which is described below.

Installation and Configuration Improvements

We’ve worked to simplify Kubeflow installation and configuration for both users and maintainers. In a near term release, i.e. 0.7.x, we’ve identified the need for a config in the lines of existing_arrikto_config, and decided to refactor it into a community-supported config. This configuration is “batteries included”, which simplifies installation.

The proposed new name is kfctl_istio_dex.yaml.

What is included:

Kubeflow applications: the intention is for this config to be a batteries-included config, with all the Kubeflow applications (Notebooks, Pipelines, TFJobs, etc.).
Exposing Kubeflow: Kubeflow is exposed on HTTPS, with a LoadBalancer with a self-signed cert by default.
Restrictions: Must not depend on any cloud-specific feature. This is targeted at existing, on-prem clusters with no access to cloud-specific services.
Auth: Istio with Dex as the OIDC Provider and the oidc-authservice as the OIDC Client. Basic Auth use Dex basic auth by default, and kfctl will use environment variables KUBEFLOW_USERNAME and KUBEFLOW_PASSWORD for configure basic auth.

Spark operator

Future versions of Kubeflow will include the Apache Spark operator. This will make it easy for users to run spark jobs on Kubernetes for preprocessing, batch inference and other ML related tasks.

Kubeflow upgrades

The community has worked to design and lay the foundation for Kubeflow upgrades. The requirements and functionality are defined in this Upgrade Design document. We plan to implement an initial version of the upgrade functionality in Kubeflow v1.0, though a preview of the upgrade functionality may be available in an upcoming v7.x release as well.

Community

Finally, thanks to all who contributed to Kubeflow 0.7! Community is at the heart of our project, and we are very thankful to 150+ contributors from 25+ organizations working together to build a Kubernetes-native, portable and scalable ML stack.

Contributors on day 2 of our recent Kubeflow Summit

In the lead-up to Kubeflow 1.0, we will need even more help, especially around testing beta functionality, reporting bugs, and updating documentation.

Here’s how to get involved:

Join the Kubeflow Slack channel
Join the kubeflow-discuss mailing list
Attend a weekly community meeting
Download and run Kubeflow, and submit bugs!

Thanks to Animesh Singh (IBM), Johnu George & Krishna Durai (Cisco), Constantinos Venetsanopoulos (Arrikto), Jeremy Lewi, Richard Lui, Lun-Kai Hsu, Ellis Bigelow, Kunming Qu, Pavel Dournov, Sarah Maddox, Abhishek Gupta, Holden Karau, Jessie Zhu, and Thea Lamkin (all of Google) for contributing to this post.