AI Tales: Building Machine learning pipeline using Kubeflow and Minio

Simple, Scalable, Performant, Portable and Cost-effective

karthic Rao
Dec 9, 2018 · 9 min read

The blog is the story about Joe and Kubeman! Interested…..? Let’s get started.

Welcome, Joe, our neighborhood Machine learning enthusiast.

Image for post
Image for post
Image for post
Image for post

Joe loves Machine Learning, Deep Learning, and Artificial Intelligence.

Image for post
Image for post

What does Joe do about it?

Image for post
Image for post

Like most of us, he takes up a bunch of courses on Machine learning, Deep learning, and Artificial Intelligence.

Image for post
Image for post

He builds Tensorflow, Pytorch, Keras Deep learning models on his laptop.

Image for post
Image for post

Joe is an AI expert! Which makes Joe the coolest dude in town, all set to Rock and Roll!

Image for post
Image for post

Joe is all excited, and he joins this fancy startup which builds ML applications to millions of customers, life couldn’t have been better than this for Joe. However….

Image for post
Image for post

Joe soon realizes that building Deep learning and Machine learning products are much more than just running models on a chunk of data!! It has its challenges, and it could be daunting!

Image for post
Image for post
This is what Joe thought Building Machine learning product is all about!

Joe had a narrowed vision of what building a Machine learning product is all about!(This is what most parts of academics teaches you to do)

Image for post
Image for post
However, this is how a real-world Machine learning application might look like.

This is how the actual Machine learning application in real world looks like, Building a machine learning model is just a part of it.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Joe’s woes don’t just stop at his wrong assumption about what a real-world ML application is all about. Portability issues start to haunt,

  • Joe finds it hard to collaborate with fellow Machine learning engineers in the company since its hard to get the same development environment across machines and operating systems.
  • Moving the models from dev to prod turns out to be a nightmare.
  • The multi-cloud strategy (One provider for training, the other for serving) makes it extremely intimidating to have a consistent environment.
Image for post
Image for post
Image for post
Image for post

Joe realizes that there are many building blocks to the production-grade Machine learning model.
It was clear for him that building tens of Machine learning applications at times is similar to the game of lego.
Most of the time, you need to reuse the existing blocks and plugin a few new ones.
You need an easy way to compose the system using various building blocks.

Image for post
Image for post

Joe is a Machine learning engineer, and he is good at it.
However, he gets tired of having to bug the DevOps engineers.
He needed them to allocate machines, schedule jobs, set up the environments, creating subnets, service discovery, running the necessary services.
Also, to obtain the metrics, setting up storage for the data, get the list of the allocated machines, schedule jobs, set up the environments, creating subnets, service discovery, running the necessary services and obtaining the metrics, setting up storage for the data.
Phew! The list goes on.

Joe wishes that he could have an easy way around solving the DevOps challenges associated with ML/DL.

Image for post
Image for post

Joe was primarily using Tensorflow. He realizes that the training of Machine learning algorithms on large amounts of data and serving the trained model as API’s for millions of customers poses a serious scaling issue.

Here are the factors because of which scaling becomes hard,

  • Storing vast volumes of data.
  • The computational overhead involved in processing large amounts of data.
  • Distributing Machine learning training across multiple computational resources.
  • Cost issues.
Image for post
Image for post

Joe found it hard to train on large amounts of data in a reasonable time.

Here were the primary reasons.

  • Throughput challenge from underlying storage system during training
  • Procuring of CPU, GPU, and TPU’s and other computational resources to scale up the performance.
  • Throttling, rate limiting, bandwidth charges and low throughput from public cloud storage systems.
Image for post
Image for post

Joe is tired and frustrated with trying to productionize the Machine learning pipeline.
As a last resort, Joe moves to managed services.
It made things more comfortable, and he could get started faster, DevOps was abstracted. It was all great.

Then the bills arrived!

Image for post
Image for post

Joe knew that using managed services made things easier.
However, the billing woes added massive pressure on his startup in its goal to be soon profitable.

Joe is worked up. He is frustrated, and he was just about to give up. Then, the Kubeman arrives for the rescue!

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Kubernetes runs everywhere — on your local machine, on Google Cloud, Amazon cloud, Microsoft Azure cloud. So by running Machine learning and Deep learning workloads on Kubernetes, it solves the portability issues.

Image for post
Image for post
Image for post
Image for post

Fortunately, Kubernetes makes the management of distributed workloads easy.
Kubernetes is a mature, production-ready platform.
It gives developers a simple API to deploy programs to a cluster of machines as if they were a single piece of hardware.
This not just simplifies deployments but also leads to better resource utilization and thus costs you less.
Using Kubernetes, computational resources can be added or removed as desired.
The same cluster can be used to train the model, and once it’s over the same can be used to serve the ML model.

Image for post
Image for post

Joe: Hey Kubeman, We all know what Kubernetes offers.

But, I’m a Machine learning engineer, one can’t deny the fact that you need significant DevOps expertise to use Kubernetes to do Machine learning.

Here are my concerns about using Kubernetes to do Machine learning.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
  • Kubernetes provides a consistent way to deploy and run your applications, and Kubeflow helps you define your ML pipeline on top of Kubernetes.
  • Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks.
  • It extends the Kubernetes API by adding new Custom Resource Definitions (CRDs) to your cluster. Hence, Machine learning workloads can be treated as first-class citizens by Kubernetes.

The Data challenge

Image for post
Image for post
Image for post
Image for post

Here is the add-on, Minio. Minio fits amazingly well into the Cloud-native environment inside Kubernetes. It’s simple, fast, scalable, and S3 compatible. Using Minio for Storing the data required for Deep learning training has the following advantage,

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

YAML is Scary!

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

Kubeflow makes use of ksonnet to help manage deployments. ksonnet acts as another layer on top of kubectl.

Image for post
Image for post
  • While Kubernetes is typically managed with static YAML files, ksonnet adds a further abstraction that is closer to standard OOP objects.
  • Resources are managed as prototypes with empty parameters, which can be instantiated into components by defining values for the parameters. Advantages?

This system makes it easier to deploy slightly different resources to different clusters at the same time, making it easy to maintain different environments for staging and production.

Kubeflow has 3 primary components

Image for post
Image for post
Kubeflow CRD’s defined by Ksonnet
  • Tf-Job: Send Tensor-flow Jobs
  • Tf-Serving: Serve trained model
  • Kubeflow-Core: Other core components.

Let’s see how easy it is to train, serve a Deep learning application using Ksonnnet and Kubeflow,

$ VERSION=v0.2.0-rc.1$ ks registry add kubeflow github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow$ ks pkg install kubeflow/core@${VERSION}
$ ks pkg install kubeflow/tf-serving@${VERSION}
$ ks pkg install kubeflow/tf-job@${VERSION}
// generate the kubeflow-core component from its prototype
ks generate core kubeflow-core --name=kubeflow-core --cloud=gke

//apply component to our cluster
ks apply cloud -c kubeflow-core

Just set the parameter for the image containing your Tensorflow code for training.
Configure the number of CPUs and GPUs required and the number of workers for distributed training.

$ ks generate tf-job my-training $ ks param listCOMPONENT              PARAM          VALUE
========= ===== =====
my-training args "null"
my-training image "null"
my-training image_gpu. "null"
my-training name "my-training"
my-training namespace "default"
my-training num_gpus 0
my-training num_masters 1
my-training num_ps 0
my-training num_workers 0
//set the parameters for this job
$ ks param set train image $TRAIN_PATH
$ ks param set train name "train-"$VERSION_TAG
// Apply the container to the cluster:
$ ks apply cloud -c train

To the view the training progress

$ kubectl get pods
$ kubectl logs $POD_NAME
//create a ksonnet component from the prototype
ks generate tf-serving serve --name=my-training-service

//set the parameters and apply to the cluster
ks param set serve modelPath http://minio2:9000
ks apply cloud -c serve

Thank you for being with us so far :)

  • We witnessed the challenges of building a production-grade Machine learning application.
  • We learn how Kubeflow, along with Minio, addresses these issues.

The sequel of blogs will contain hands-on end to end examples containing training, serving and building applications on top of popular ML/DL algorithms.

See you all soon, till then, Happy coding :)

Kredaro-engineering

Data driven SRE.

karthic Rao

Written by

Co-founder at Stealth. Code with Love, Learn with Passion, Feel the music, Live like a hacker.

Kredaro-engineering

Data driven SRE. High performance, Cost effective Analytics, ML/AI at Scale. Ping us at hello@kredaro.com.

karthic Rao

Written by

Co-founder at Stealth. Code with Love, Learn with Passion, Feel the music, Live like a hacker.

Kredaro-engineering

Data driven SRE. High performance, Cost effective Analytics, ML/AI at Scale. Ping us at hello@kredaro.com.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store