Kubeflow 1.0 — Quick Overview

Subhash Burramsetty
9 min readMay 28, 2020

Introduction:

Kubeflow is an open-source and free machine learning Kubernetes-native platform for developing, orchestrating, deploying and running scalable and portable machine learning workloads. It started as a Google internal project and now made as an open source project. It made its debut at the annual KubeCon Conference 2017 and almost three years later, version 1.0 is released in March 2020. It can be installed in cloud, On-prem and local machines as well.

It is built mainly around 3 principles:

  1. Composability
  2. Portability
  3. Scalability

Here’s a reference architecture of Kubeflow on AWS:

Image source: https://www.kubeflow.org/docs/aws/aws-e2e/

Kubeflow Components:

Central Dashboard:

Kubeflow Central Dashboard

Central Dashboard is the User Interface in Kubeflow which provides quick overview and faster access to the components deployed in the cluster. It provides a quick glance of recent notebooks, recent pipeline runs, quick shortcuts to all the features and allows you to share user access across namespaces in Kubeflow deployment.

Notebooks:

Notebooks provide you the ability to configure and create Jupyter notebooks in the Kubeflow cluster on the fly.

Notebook Servers page view
Notebook creation page view

Some of the features of Notebooks are:

  • Ability to spin up a Notebook using default or custom docker Images
  • Ability to configure CPU, RAM, GPU, Storage with a simple UI during notebook creation
  • Ability to configure secrets and environment variables
  • Ability to create or mount existing storage volumes to the notebooks
  • Completely integrated with the multi-user isolation feature and helps you to isolate your notebooks from other users of the cluster

Pipelines:

A pipeline is a description of an ML workflow, including all of the components that make up the steps in the workflow and how the components interact with each other. Pipelines is a core component of Kubeflow which makes it easy to build and deploy the machine learning pipelines based on docker containers without worrying about the low-level details of managing a Kubernetes cluster. Kubeflow pipelines aim to provide end-to-end orchestration, easy experimentation and easy re-use.

Sample graph of a Kubeflow pipeline (Image Source: https://cloud.google.com/solutions/machine-learning/architecture-for-mlops-using-tfx-kubeflow-pipelines-and-cloud-build)

As you can see in the above sample pipeline, each and every box is a pipeline component and runs in its own separate container when a pipeline runs. When the pipeline runs, you will be able to click on each component, look at its input and output parameters, input and output artifacts, container logs, confusion matrix, tensor boards, visualisations and so on.

As cool as it looks and sounds, it is a bit complex to construct these pipelines as well. There are multiple ways of creating and running pipelines in Kubeflow. Lets have a look at them:

Method 1 — DevOps Way: Segregate all the parts of code into multiple code files, create new version of each component by creating a docker image with the code, configure the input, output params, sequence in which components has to run and integrate them all together to create a Kubeflow pipeline.

Steps to create KFPipeline (Image Source: https://www.kubeflow.org/docs/pipelines/sdk/sdk-overview/)

Method 2 — Data Scientist friendly way using Papermill : Place all the machine learning code in your notebooks, group all your input parameters into a single notebook cell, tag it as “parameters” and use the papermill tool to run the whole notebook as a single bundle. This makes it a bit complex when you need to perform debugging or have a glance at intermediate artifacts, tensorboards etc.

Image source: https://towardsdatascience.com/how-to-deploy-jupyter-notebooks-as-components-of-a-kubeflow-ml-pipeline-part-2-b1df77f4e5b3

Method 3 — Using Kale (Kubeflow Automated pipeLines Engine): In this method, we will tag the Notebook cells with names and compile it with Kale to generate the pipeline code automatically.

KF pipeline creation using Kale (Image source: https://medium.com/kubeflow/automating-jupyter-notebook-deployments-to-kubeflow-pipelines-with-kale-a4ede38bea1f)

Also, pipelines provide us with options to configure the triggers or schedule them using cron jobs to run them as per the need. Needless to say, Pipelines is an amazing component in the Kubeflow ecosystem.

Frameworks for Training:

Kubeflow provides Kubernetes custom resource (CRD) that can be used to run various non-distributed and distributed training jobs on top of Kubernetes. Some of them are:

  • TensorFlow Training (TFJob)
  • PyTorch Training
  • MXNet Training
  • MPI Training
  • Chainer Training

Hyperparameter Tuning(Katib):

Image source: https://raw.githubusercontent.com/kubeflow/katib/master/docs/images/katib-ui.png

Katib is used for the tuning of machine learning model hyperparameter tuning in Kubeflow. It is based on Google Vizier. It has both Web UI and API support. It supports a lot of ML frameworks including TensorFlow, Apache MXNet, PyTorch, XGBoost and so on.

Hyperparameter tuning Algorithms:

  • Random Search
  • Grid Search
  • Hyperband
  • Bayesian Optimization
  • CMA Evolution Strategy
  • Tree of Parzen Estimators (TPE)

Neural Architecture Search:

  • Efficient Neural Architecture Search (ENAS)

Tools for ML model serving:

Kubeflow supports both standalone model serving system and multi-framework model serving. Based on the requirements, one can choose the framework that best supports their model serving requirements.

Multi-framework serving:

  • KFServing
  • Seldon Core Serving
  • BentoML

Stand alone framework serving:

  • Tensorflow Serving
  • Tensorflow Batch Prediction
  • NVIDIA Triton Inference Serving

KFServing:

Image source: https://raw.githubusercontent.com/kubeflow/kfserving/master/docs/diagrams/kfserving.png

KFServing is a framework for Serverless Inferencing on Kubernetes which aims to solve production model serving use cases by providing high abstraction interfaces for ML frameworks like Tensorflow, XGBoost, SKLearn, PyTorch etc and provides a simple pluggable interfaces for ML Serving including prediction, pre-processing, post-processing and explainability.

It encapsulates the complexity of:

  • Autoscaling
  • Networking
  • Load balancing
  • Health checking
  • Server configuration

Provides serving features for ML deployments like:

  • GPU Autoscaling
  • Scale to Zero
  • Canary Rollouts

Multi-tenancy:

  • A reliable way to isolate and protect their own resources, without accidentally viewing or changing each other’s resources
  • Isolation mechanisms also prevent accidental deletion/modification of resources of other users in the deployment.
  • Multi-tenancy is currently built around user namespaces
  • New namespaces can be created using “kind:profile” in Kubernetes
  • Uses Istio to control in-cluster traffic.
  • Kubeflow defines user-specific namespaces and uses Kubernetes role-based access control (RBAC) policies to manage user access.
  • Kubeflow uses the Profile custom resource to control all policies, roles, and bindings involved and to guarantee consistency.

Key concepts in multi-tenancy:

  • Administrator: An administrator is someone who creates and maintains the Kubeflow cluster. This person has the permission to grant access permissions to others
  • User: A user is someone who has access to some set of resources in the cluster. A user needs to be granted access permissions by the administrator
  • Profile: A profile is a grouping of all Kubernetes clusters owned by a user

Current integration and limitations of multi-tenancy:

  • Jupyter notebooks service is the first application to be fully integrated with multi-user isolation
  • Metadata and Pipelines or any other applications currently don’t have full fledged integration with isolation

Fairing:

  • Streamlines the process of building, training, and deploying machine learning (ML) training jobs in a hybrid cloud environment
  • You can run your ML training job locally or in the cloud, directly from Python code or a Jupyter notebook
  • Allows you to deploy the trained model on Kubeflow

Kubeflow Application Matrix, SDK and CLI:

As of 3rd May 2020 (Image Source: https://www.kubeflow.org/docs/reference/version-policy/)
As of 3rd May 2020 (Image Source: https://www.kubeflow.org/docs/reference/version-policy/)

From the above images, we can observe that the status of lot of components is still in Alpha and Beta stages. Kindly refer to the following link for current status of Application matrix and versions of Kubeflow SDK and CLI: https://www.kubeflow.org/docs/reference/version-policy/

Usage Reporting:

Kubeflow provides users the options for Opting in or out of reporting of Kubeflow’s anonymous usage data. It is completely voluntary and when enabled, Kubeflow will use Spartakus reporting tool to report anonymous usage data.

For more details, refer to the following link: https://www.kubeflow.org/docs/other-guides/usage-reporting/

Limitations of Kubeflow:

  1. At the moment, there is no option to logout from Kubeflow application after logging in in the multi-user mode. I tried Authentication with AWS Cognito and observed either you have to figure out way to sign out session from you browser or have to wait until the session expires. (Issue: https://github.com/kubeflow/kubeflow/issues/4702)
  2. I haven’t come across any option in UI mode to modify the configuration of notebooks once they are created.
  3. If there are not enough resources in cluster and if you try to run some training jobs etc, the pods will end up stuck in Pending state until the resources are available. This might end up being an issue in the case of daily scheduled jobs if pods end in pending state and do not run which might affect SLAs
  4. Multi-user Isolation feature is in stable state and is currently fully integrated only to Notebooks. But the following note is mentioned in the official docs of Multi-tenancy i.e “Note that the isolation support in Kubeflow doesn’t provide any hard security guarantees against malicious attempts by users to infiltrate other user’s profiles.” (Link: https://www.kubeflow.org/docs/components/multi-tenancy/overview/)
  5. Kubeflow currently doesn’t have a dedicated CI/CD tool to trigger pipelines. It can still be overcome combination of pipelines and 3rd party integrations.
  6. It would be great if a UI is available in future for purposes where one can deploy the ML model artifacts using UI and configure configurations like canary deployments and to split traffic between multiple models via UI so that one need not learn K8s to configure them.
  7. There are lot of versions available out there for Kubernetes, Istio, Knative Serving, Cert Manager, kustomize, helm etc. We need to make sure that we are installing the right versions that support and go with each other.

Note: The above contents are based on my own opinions and understanding based on resources available at hand by the time this blog was written.

Summary:

Kubeflow is on a mission to make scaling ML models and deploying them to production as simple as possible. It is amazing to see how Kubeflow has evolved during the last few years. It helps maintain track of all the users, pipelines, runs, generated artifacts, deployments, scaling and so on. If you are an enterprise who has a lot of data scientists, hundreds of Machine learning models then Kubeflow is definitely a must needed tool for you which provides better trade off in terms of cost as well. Go give it a try and see for yourself.

References:

  1. https://www.kubeflow.org/docs/
  2. https://github.com/kubeflow
  3. https://github.com/kubeflow/kfserving
  4. https://github.com/kubeflow/pipelines
  5. https://github.com/kubeflow/katib
  6. https://towardsdatascience.com/how-to-create-and-deploy-a-kubeflow-machine-learning-pipeline-part-1-efea7a4b650f
  7. https://towardsdatascience.com/how-to-deploy-jupyter-notebooks-as-components-of-a-kubeflow-ml-pipeline-part-2-b1df77f4e5b3
  8. https://medium.com/google-cloud/how-to-carry-out-ci-cd-in-machine-learning-mlops-using-kubeflow-ml-pipelines-part-3-bdaf68082112
  9. https://towardsdatascience.com/kubeflow-for-poets-a05a5d4158ce
  10. https://d1.awsstatic.com/events/reinvent/2019/REPEAT_1_Building_machine-learning_infrastructure_on_Amazon_EKS_with_Kubeflow_CON306-R1.pdf
  11. https://eksworkshop.com/advanced/420_kubeflow/
  12. https://medium.com/kubeflow/an-end-to-end-ml-workflow-from-notebook-to-kubeflow-pipelines-with-minikf-kale-72244d245d53

and many more :)

--

--