Quick Start With Kubeflow Pipelines on Google Cloud Platform

Published in

The Startup

8 min readNov 19, 2020

The main purpose of this article is to demonstrate the way to install and use Kubeflow on Google Cloud Platform(GCP). There are multiple sections here that narrate a typical Machine Learning(ML) life cycle, the introduction to Kubeflow, the procedure to set up, creating sample pipelines, and finally, ends with use case scenarios.

ML Life Cycle and Its Challenges

When most people hear of machine learning, they often jump first to building models. Several popular frameworks make this process much easier, such as TensorFlow, PyTorch, Scikit Learn, XGBoost, and Caffe. Each of these platforms is designed to make data scientists' jobs easier as they explore their problem space.

However, in the reality of building an actual production-grade solution, there are many more complex steps. These include importing, transforming, and visualizing the data; building and validating the model; training the model at scale; and deploying the model to production. Focusing only on model training misses the majority of the day-to-day job of a data scientist.

These are some of the key challenges:

A significant amount of time was consumed in setting up the pipeline for training and evaluation.
Difficulty in reusing the ML components because of the closely coupled code base and infra components.
Extra efforts are required to track the experimentations and versioning of ML models.
Difficulty in tracing and reproducing the output from ML components.

Introducing Kubeflow

Kubeflow is a Machine learning toolkit. It is an Open-source platform designed to orchestrate complicated machine learning workflows running on Kubernetes. It is dedicated to making deployments of ML workflows on Kubernetes simple, portable and scalable. Initially, Kubeflow started to work as a simpler way to run TensorFlow works on Kubernetes, which was based on a pipeline known as TensorFlow Extended, and then it expanded to be a multi-architecture, multi-cloud framework for running entire machine learning pipelines.

This platform is meant for the data scientists who want to build and experiment with ML pipelines. Kubeflow is also for ML engineers and operational teams who want to deploy ML systems to various environments for development, testing, and production-level serving.

This provides a native way to extend the same features of k8s to an organization’s ML needs. This is particularly useful to augment existing services without rewriting deployments from scratch. Having Kubeflow in the organization means one needs to worry more about the problem at hand and less worry about how to set things up and manage it over time.

Kubeflow solves most of the ML challenges. End-to-end orchestration, Reusability of pipelines, tracking of experiments, hyperparameter tuning using Katib, multiple frameworks for training are some of its key features.

Setting up Kubeflow on Google Cloud Platform

This section describes the procedure to create infrastructure and setup Kubeflow on GCP using AI Platform pipelines. This is the managed version of Kubeflow which can be deployed onto a k8s cluster from GCP Marketplace.

MLOps is the practice of applying DevOps practices to help automate, manage, and audit ML workflows. ML pipelines are portable and reproducible definitions of ML workflows. AI Platform Pipelines makes it easier to get started with MLOps by saving you the difficulty of setting up Kubeflow Pipelines with TensorFlow Extended (TFX).

Prerequisites

A Google Cloud Platform Project
IAM Roles required for a user to create AI Platform pipelines are Project Viewer & Kubernetes Engine Admin. To access the pipelines Kubernetes Engine Cluster Viewer and the Service Account User (at the cluster service account level) are the minimum roles.

Note: For user authentication to pipelines dashboard, Google Sign-in functionality is implemented by default which is integrated with IAM. After a user signs in, it checks for the required IAM roles. Users without these roles will not be able to access the pipelines dashboard. In this way access control of users can be restricted using IAM. This makes it more secure.

Steps to be followed:

Navigate to AI Platform Pipelines in the GCP console. It looks like this.

In the AI Platform Pipelines toolbar, click on New Instance. Kubeflow Pipelines page opens in Google Cloud Marketplace.

Click Configure. The Deploy Kubeflow Pipelines page opens.

Select the Cluster zone where your GKE cluster should be created. Eg: us-central1-a.
Check to Allow access to the following Cloud APIs to grant applications that run on your GKE cluster access to Google Cloud resources. This access scope provides full access to the Google Cloud resources that you have enabled in your project.
Click on the Create Cluster button. The cluster creation takes a few minutes. There is also another way to do this, create a cluster manually before coming to this page and choose it to deploy Kubeflow pipelines.
Namespaces offer logical isolation of k8s resources. Multiple AI Platform Pipelines can be created in the same GKE cluster and their resources can be isolated from each other using different namespaces. Multiple pipelines can also be created in multiple clusters.
On the Namespace field dropdown, click on Create a namespace and enter a new namespace name such as dev, prod, etc.,

In the App instance name box, enter a name for your Kubeflow Pipelines instance. Leave all other fields as they are optional.
Click Deploy and wait for few minutes to see the completion status.

To access the Pipelines dashboard, open AI Platform Pipelines in the Cloud Console.
Then, click the Open pipelines dashboard for your AI Platform Pipelines instance.

After successful installation the Kubeflow Pipelines Dashboard looks like this:

Note:
The minimum GKE cluster configuration is 3 nodes with 2 vCPU, 4GB RAM for each node. Auto scaling enabled with minimum 3 nodes and maximum 6 nodes. The cluster’s access scope must grant full access to all Cloud APIs, or your cluster must use a custom service account.
In case of addition of more AI Platform pipelines to the same cluster, as the auto scaling configuration is enabled, the no. of nodes will increase to accommodate the new pipeline.

This completes the installation procedure. It takes around 15 to 20 minutes for completing all these steps. can proceed ahead by creating pipelines and experiments.

Creating a sample pipeline

A Pipeline is a group of components. For suppose consider a training pipeline with three steps say preprocessing, model training, and evaluation. These individual steps are referred to as components. They run in separate docker containers. All these components together make a pipeline.
Runs are multiple executions of a pipeline. A single pipeline can be used several times by creating a new run each time.
Experiments help to group all similar runs of a pipeline in one place. A comparative view of the performance of multiple runs can be made.
Navigate to the Pipelines section of the dashboard.

There are some tutorial pipelines available after the installation as shown in the above image. Beginners can experiment with these tutorial pipelines.
To create a custom pipeline click on the Upload Pipeline button on the top right corner.

Fill in the Pipeline name, description fields. Pipeline script written in a specified syntax should be uploaded to create a pipeline or can import it by providing its URL if it is publicly accessible. It can be in the form of a YAML file or .zip, .tar.gz packages.
Refer to AI Hub Samples, it has the sample pipeline scripts which can be useful to start with instead of writing from scratch.

Click on the Create button after filling in the details. The pipeline will be created. To actually execute a pipeline, create a run of the pipeline.
Here to show how to execute a pipeline, the example pipeline [Tutorial] DSL — Control structures are used.
Select this tutorial pipeline and click on the Create run button, fill in the required details, and start the pipeline. The run will be created. Observe its status as shown below.

In this way, Kubeflow pipelines can be created. Runs are the multiple executions of a pipeline and experiments help to group all similar runs of a pipeline in one place.

Use Case Scenarios

Deploying and managing a complex ML system at scale: With Kubeflow, one can manage the entire AI organization at scale. Kubeflow’s core and ecosystem critical user journeys (CUJs) provide software solutions for end-to-end workflows, which means one can easily build, train, deploy, develop a model, create, run, and explore a pipeline.
Experimentation with training an ML model: Kubeflow provides stable software sub-systems for model training including Jupyter notebooks, popular machine learning training operators such as Tensorflow and Pytorch that run efficiently and securely in Kubernetes isolated namespaces.
End to end hybrid and multi-cloud ML workloads: Kubeflow is supported by all major cloud providers and available for on-premises installation, which fulfills the requirement of developing machine learning models in a hybrid as well as with multi-cloud portability.
Tuning the model hyperparameters during training: Tuning hyperparameters is critical for model performance and accuracy. With the help of Kubeflow’s hyperparameter tuner (Katib), model hyperparameters tuning can be easily done in an automated way. This automation not only lessens the computation time but also speeds up the delivery of improved models.
Continuous integration and deployment (CI/CD) for ML: Although, Kubeflow currently does not have a dedicated tool for CI/CD yet Kubeflow Pipelines can be used to create reproducible workflows. These workflows automate the steps needed to build an ML workflow, which delivers consistency, saves iteration time, and helps in debugging, compliance requirements, and more.

Conclusion

This article has the Kubeflow installation and usage procedure on GCP. Refer to Setting up AI Platform Pipelines in the GCP documentation for more details.
To do this setup on-premise or in any other cloud, refer to Kubeflow documentation. However, the approach to utilize it remains the same irrespective of the platform.