AI Tales: Building Machine learning pipeline using Kubeflow and Minio

Simple, Scalable, Performant, Portable and Cost-effective

karthic Rao

Published in

Kredaro-engineering

9 min readDec 9, 2018

The blog is the story about Joe and Kubeman! Interested…..? Let’s get started.

Welcome, Joe, our neighborhood Machine learning enthusiast.

Joe loves Machine Learning, Deep Learning, and Artificial Intelligence.

What does Joe do about it?

Like most of us, he takes up a bunch of courses on Machine learning, Deep learning, and Artificial Intelligence.

He builds Tensorflow, Pytorch, Keras Deep learning models on his laptop.

Joe is an AI expert! Which makes Joe the coolest dude in town, all set to Rock and Roll!

Joe is all excited, and he joins this fancy startup which builds ML applications to millions of customers, life couldn’t have been better than this for Joe. However….

The ground realities of building Machine learning applications

Joe soon realizes that building Deep learning and Machine learning products are much more than just running models on a chunk of data!! It has its challenges, and it could be daunting!

This is what Joe thought Building Machine learning product is all about!

Joe had a narrowed vision of what building a Machine learning product is all about!(This is what most parts of academics teaches you to do)

However, this is how a real-world Machine learning application might look like.

This is how the actual Machine learning application in real world looks like, Building a machine learning model is just a part of it.

The portability challenge

Joe’s woes don’t just stop at his wrong assumption about what a real-world ML application is all about. Portability issues start to haunt,

Joe finds it hard to collaborate with fellow Machine learning engineers in the company since its hard to get the same development environment across machines and operating systems.
Moving the models from dev to prod turns out to be a nightmare.
The multi-cloud strategy (One provider for training, the other for serving) makes it extremely intimidating to have a consistent environment.

Joe realizes that there are many building blocks to the production-grade Machine learning model.
It was clear for him that building tens of Machine learning applications at times is similar to the game of lego.
Most of the time, you need to reuse the existing blocks and plugin a few new ones.
You need an easy way to compose the system using various building blocks.

Joe is a Machine learning engineer, and he is good at it.
However, he gets tired of having to bug the DevOps engineers.
He needed them to allocate machines, schedule jobs, set up the environments, creating subnets, service discovery, running the necessary services.
Also, to obtain the metrics, setting up storage for the data, get the list of the allocated machines, schedule jobs, set up the environments, creating subnets, service discovery, running the necessary services and obtaining the metrics, setting up storage for the data.
Phew! The list goes on.

Joe wishes that he could have an easy way around solving the DevOps challenges associated with ML/DL.

Joe was primarily using Tensorflow. He realizes that the training of Machine learning algorithms on large amounts of data and serving the trained model as API’s for millions of customers poses a serious scaling issue.

Here are the factors because of which scaling becomes hard,

Storing vast volumes of data.
The computational overhead involved in processing large amounts of data.
Distributing Machine learning training across multiple computational resources.
Cost issues.

Joe found it hard to train on large amounts of data in a reasonable time.

Here were the primary reasons.

Throughput challenge from underlying storage system during training
Procuring of CPU, GPU, and TPU’s and other computational resources to scale up the performance.
Throttling, rate limiting, bandwidth charges and low throughput from public cloud storage systems.

Joe is tired and frustrated with trying to productionize the Machine learning pipeline.
As a last resort, Joe moves to managed services.
It made things more comfortable, and he could get started faster, DevOps was abstracted. It was all great.

Then the bills arrived!

Joe knew that using managed services made things easier.
However, the billing woes added massive pressure on his startup in its goal to be soon profitable.

Joe is worked up. He is frustrated, and he was just about to give up. Then, the Kubeman arrives for the rescue!

Kubernetes already solves most of these issues!

You Run wherever Kubernetes runs: Portability solved

Kubernetes runs everywhere — on your local machine, on Google Cloud, Amazon cloud, Microsoft Azure cloud. So by running Machine learning and Deep learning workloads on Kubernetes, it solves the portability issues.

Pick your containers and K8’s orchestrate: Composability solved!

Fortunately, Kubernetes makes the management of distributed workloads easy.
Kubernetes is a mature, production-ready platform.
It gives developers a simple API to deploy programs to a cluster of machines as if they were a single piece of hardware.
This not just simplifies deployments but also leads to better resource utilization and thus costs you less.
Using Kubernetes, computational resources can be added or removed as desired.
The same cluster can be used to train the model, and once it’s over the same can be used to serve the ML model.

Joe: Hey Kubeman, We all know what Kubernetes offers.

But, I’m a Machine learning engineer, one can’t deny the fact that you need significant DevOps expertise to use Kubernetes to do Machine learning.

Here are my concerns about using Kubernetes to do Machine learning.

Introducing Kubeflow!

Kubernetes provides a consistent way to deploy and run your applications, and Kubeflow helps you define your ML pipeline on top of Kubernetes.
Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks.
It extends the Kubernetes API by adding new Custom Resource Definitions (CRDs) to your cluster. Hence, Machine learning workloads can be treated as first-class citizens by Kubernetes.

The Data challenge

Here is the add-on, Minio. Minio fits amazingly well into the Cloud-native environment inside Kubernetes. It’s simple, fast, scalable, and S3 compatible. Using Minio for Storing the data required for Deep learning training has the following advantage,

YAML is Scary!

Kubeflow makes use of ksonnet to help manage deployments. ksonnet acts as another layer on top of kubectl.

While Kubernetes is typically managed with static YAML files, ksonnet adds a further abstraction that is closer to standard OOP objects.
Resources are managed as prototypes with empty parameters, which can be instantiated into components by defining values for the parameters. Advantages?

This system makes it easier to deploy slightly different resources to different clusters at the same time, making it easy to maintain different environments for staging and production.

Kubeflow has 3 primary components

Tf-Job: Send Tensor-flow Jobs
Tf-Serving: Serve trained model
Kubeflow-Core: Other core components.

Let’s see how easy it is to train, serve a Deep learning application using Ksonnnet and Kubeflow,

Install Kubeflow using Ksonnet

$ VERSION=v0.2.0-rc.1$ ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow$ ks pkg install kubeflow/core@${VERSION}
$ ks pkg install kubeflow/tf-serving@${VERSION}
$ ks pkg install kubeflow/tf-job@${VERSION}// generate the kubeflow-core component from its prototype
ks generate core kubeflow-core --name=kubeflow-core --cloud=gke

//apply component to our cluster
ks apply cloud -c kubeflow-core

Running a Deep learning training job

Just set the parameter for the image containing your Tensorflow code for training.
Configure the number of CPUs and GPUs required and the number of workers for distributed training.

$ ks generate tf-job my-training $ ks param listCOMPONENT              PARAM          VALUE
=========              =====          =====
my-training             args          "null"
my-training             image         "null"
my-training             image_gpu.    "null"
my-training             name          "my-training"
my-training             namespace     "default"
my-training             num_gpus 0
my-training             num_masters 1
my-training             num_ps  0
my-training             num_workers 0//set the parameters for this job
$ ks param set train image $TRAIN_PATH
$ ks param set train name "train-"$VERSION_TAG// Apply the container to the cluster:
$ ks apply cloud -c train

To the view the training progress

$ kubectl get pods
$ kubectl logs $POD_NAME

Now to serve the trained model

//create a ksonnet component from the prototype
ks generate tf-serving serve --name=my-training-service

//set the parameters and apply to the cluster
ks param set serve modelPath http://minio2:9000
ks apply cloud -c serve

Thank you for being with us so far :)

We witnessed the challenges of building a production-grade Machine learning application.
We learn how Kubeflow, along with Minio, addresses these issues.

The sequel of blogs will contain hands-on end to end examples containing training, serving and building applications on top of popular ML/DL algorithms.

See you all soon, till then, Happy coding :)