Simple, Scalable, Performant, Portable and Cost effective

karthic Rao
Dec 9, 2018 · 9 min read

Welcome to AI tales from Kredaro :) This is the story about Joe and Kubeman! Interested…..? Let’s get started.

Welcome Joe, our neighbourhood Machine learning enthusiast.

Joe loves Machine Learning, Deep Learning and Artificial Intelligence.

What does Joe do about it?

Like most of us he takes up bunch of courses on Machine learning, Deep learning and Artificial Intelligence.

He builds Tensor-flow, Pytorch, Keras Deep learning models on his laptop.

Joe is an AI expert! Which makes Joe the coolest dude in town, all set to Rock and Roll!!!!

Joe is all excited, he joins this fancy startup which builds ML applications to millions of customers, life couldn’t have been better than this for Joe. But….

The ground realities of building Machine learning applications

Joe soon realises that building Deep learning and Machine learning products are much more that just running models on a chunk of data!! It has its own challenges and it could be daunting!

This is what Joe thought Building Machine learning product is all about!

Joe has assumed that Machine learning is all about Building the Machine learning model (This is what most parts of academics teaches you to do)

But this is how a real world Machine learning application might look like

This is how the actual Machine learning application in real world looks like, Building a machine learning model is just a part of it.

The portability challenge

Joe’s woes doesn’t just stop at his wrong assumption about what a real world ML application looks like. Portability issues starts to haunt,

  • Joe finds it hard to collaborate with fellow Machine learning engineers in the company since its hard to get the same development environment across machines and operating systems.
  • Moving the models from dev to prod turns out to be a nightmare.
  • The multicloud strategy (One provider for training, the other for serving …) makes it extremely intimidating to have a consistent environment.

Once Joe realizes that there are many building blocks to production grade Machine learning model it was clear for him that building tens of Machine learning applications at times is similar to game of lego. Most of the times you need to reuse the existing blocks and plug in few new ones, he needs an easy way to compose the system using various building blocks.

Joe is a Machine learning engineer and he is good at it. But he gets tired of having to bug the Devops engineers to allocate machines, schedule jobs, set up the environments, creating subnets, service discovery, running the necessary services and obtaining the metrics, setting up storage for the data and the list just goes on. Joe wishes that he could have an easy way around solving the Devops challenges associated with ML/DL.

Joe was primarily using Tensorflow. He realizes that the training of Machine learning algorithms on large amounts of data and serving the trained model as API’s for millions of customers poses a serious scaling issue. Here are the factors because of which scaling becomes hard,

  • Storing huge volumes of data.
  • Huge computation involved in processing large amounts of data during Machine learning training.
  • Distributing the Machine learning training across multiple computational resources.
  • Cost issues.

Joe found it hard to train on the large amounts of data in reasonable time, here were the primary reasons.

  • Throughput challenge from underlying storage system during training
  • Procuring of CPU, GPU and TPU’s and other computational resources to scale up the performance.
  • Throttling, rate limiting, bandwidth charges and low throughput from public cloud storage systems.

Joe is tired and frustrated of trying to productionize the Machine learning pipeline. As a last resort Joe moves to managed services. It made things easier, he could get started faster, Devops was abstracted. It was all great until…….

It was all amazing, until the bills arrived! Joe knew that using managed services and satisfying their billing appetite for billing would add huge pressure on his startup in its goal to be soon profitable.

Joe is worked up, he is frustrated, when he was just about to give up ….The Kubeman arrives for rescue!

Kubernetes already solves most of these issues!

You Run wherever Kubernetes runs: Portability solved

Kubernetes runs everywhere — on your local machine, on Google cloud, Amazon cloud, Microsoft Azure cloud*. So by running Machine learning and Deep learning workloads on Kubernetes it solves the portability issues.

Pick your containers and K8’s orchestrate: Composability solved!

Fortunately, Kubernetes makes management of distributed workloads easy. Kubernetes is a mature, production ready platform that gives developers a simple API to deploy programs to a cluster of machines as if they were a single piece of hardware, this not just simplifies deployments but also leads better resource utilization and thus costs you less. Using Kubernetes, computational resources can be added or removed as desired, and the same cluster can be used to train the model and once its over the same can be used to serve ML model.

Joe: Hey Kubeman, We all know what Kubernetes offers, But I’m a Machine learning engineer, one can’t deny that fact that you need significant Devops expertise to use Kubernetes to do Machine learning, here are my concerns using Kubernetes to do Machine learning.

Introducing Kubeflow! Phew…!

  • Kubernetes provides consistent way to deploy and run your applications and Kubeflow helps you define your ML pipeline on top of Kubernetes.
  • Kubeflow adds some resources to your cluster to assist with a variety of tasks, including training and serving models and running Jupyter Notebooks.
  • It extends the Kubernetes API by adding new Custom Resource Definitions (CRDs) to your cluster, so machine learning workloads can be treated as first-class citizens by Kubernetes.

The Data challenge

Here is the add-on, Minio. Minio fits amazingly well into the Cloud native environment inside Kubernetes. Its simple, fast, scalable and S3 compatible. Using Minio for Storing the data required for Deep learning training has the following advantage,

YAML is Scary!

Kubeflow makes use of ksonnet to help manage deployments. ksonnet acts as another layer on top of kubectl.

  • While Kubernetes is typically managed with static YAML files, ksonnet adds a further abstraction that is closer to standard OOP objects.
  • Resources are managed as prototypes with empty parameters, which can be instantiated into components by defining values for the parameters.


This system makes it easier to deploy slightly different resources to different clusters at the same time, making it easy to maintain different environments for staging and production.

Kubeflow has 3 primary components

Kubeflow CRD’s defined by Ksonnet
  • Tf-Job: Send Tensor-flow Jobs
  • Tf-Serving: Serve trained model
  • Kubeflow-Core: Other core components.

Let’s see how easy it is to train, serve a Deep learning application using Ksonnnet and Kubeflow

Install Kubeflow using Ksonnet

$ VERSION=v0.2.0-rc.1$ ks registry add kubeflow${VERSION}/kubeflow$ ks pkg install kubeflow/core@${VERSION}
$ ks pkg install kubeflow/tf-serving@${VERSION}
$ ks pkg install kubeflow/tf-job@${VERSION}
// generate the kubeflow-core component from its prototype
ks generate core kubeflow-core --name=kubeflow-core --cloud=gke

//apply component to our cluster
ks apply cloud -c kubeflow-core

Running a Deep learning training job

Just set the parameter for the image containing your Tensorflow code for training, number CPU’s and GPU’s required and number of workers for distributed training and you’re set.

$ ks generate tf-job my-training $ ks param listCOMPONENT              PARAM          VALUE
========= ===== =====
my-training args "null"
my-training image "null"
my-training image_gpu. "null"
my-training name "my-training"
my-training namespace "default"
my-training num_gpus 0
my-training num_masters 1
my-training num_ps 0
my-training num_workers 0
//set the parameters for this job
$ ks param set train image $TRAIN_PATH
$ ks param set train name "train-"$VERSION_TAG
// Apply the container to the cluster:
$ ks apply cloud -c train

To the view the training progress

$ kubectl get pods
$ kubectl logs $POD_NAME

Now to serve the trained model

//create a ksonnet component from the prototype
ks generate tf-serving serve --name=my-training-service

//set the parameters and apply to the cluster
ks param set serve modelPath http://minio2:9000
ks apply cloud -c serve

Thank you for being with us so far :)

  • We witnessed the challenges of building a production grade Machine learning application.
  • We learn how Kubeflow along with Minio addresses these issues .

The sequel of blogs will contain hands on end to end examples containing training, serving and building applications on top of popular ML/DL algorithms.

See you all soon, till then… Happy coding :)

About: We are Kredaro, An super early stage startup focussed on Data driven SRE, High performance + Cost effective Analytics, ML/AI at Scale. Let’ s talk, Ping us at


Data driven SRE. High performance, Cost effective Analytics, ML/AI at Scale. Ping us at

karthic Rao

Written by

Developer Advocate at Dgraph labs. Code with Love, Learn with Passion, Feel the music, Live like a hacker.


Data driven SRE. High performance, Cost effective Analytics, ML/AI at Scale. Ping us at

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade