Building a Data Science Workbench on Kubernetes — Part 1

Souvik Ghosh
strai
Published in
8 min readFeb 18, 2019

Quite recently, I have been digging a lot in quick experiments with Rasa stack that is using advanced NLP techniques for building chatbots (one of my most favourite subjects). I also heard so much about so many open-source technologies recently that drew me here to write about this.

Today, in the market you have so many Data Science Workbench solutions from Azure Databricks, Domino Data labs, Dataiku, AWS Sagemaker just to name a few or to be honest the ones i have heard of. 😜 But picking on the latest trends of the market which is Kubernetes, since most companies specially big companies deploying Microservices wants to be cloud vendor agnostic, are slowly moving towards deploying their applications on container orchestration solutions like Kubernetes with their Docker images. It is really cool and even lets you to jump ship between the top 4 cloud providers ( AWS, GCP, Azure, Digital Ocean) at any given time making the cloud computing space ever so competitive and dramatic. Thank you Google( company that made Kubernetes open source).

So if engineers are taking advantage of such impressive technology, why can’t Data science?

Objective

Some great open source Data Science and Data processing tools are already available that are used everyday by many companies from Anaconda cloud to H20.ai to Apache Spark or even simple open source workbench like Jupyterhub. They all come with their advantages and disadvantages but like every technologist, we all have our preferences and in a world where it is the Data which dictates value, numerous tools can each solve many different problems and WE WANT THEM ALL!! At least i do.

Anyways 😌, let’s look at what is my objective for writing this article. Here’s my wishlist

  • Collaborative workspace — Provide a good collaborative environment for a Data science team to work together. Ref: github.com , I will be using github because i am used to it. you have alternatives that you can deploy on your own such as gitlab.com
  • Favourite tools and technologies readily available for Data science team members — For me, I will start with Jupyter and python and install scikit-learn ( A Data Scientist’s 1–800 tech support)
  • Dynamic space allocation — Making sure, I as a Data scientist have enough memory to load my dataset on a database(PostgresDB) and a File system to store my models in a persistent volume. I need some space!!
  • Dynamic compute — Writing boilerplate deployment configuration for Kubernetes to deploy a workbench environment consisting of 2 things
    - Storage(PostgresDB)
    - Workspace(Jupyter Notebook)
    And shutting it down when I am done
  • Workspace isolation — using namespaces to isolate my working environment essentially creating walls only accesible for a particular project
  • Easy deployment — Use one dedicated node in my cluster to deploy apps that can serve users over an ingress
  • Administration Guide — Using tools such as Kubernetic to manage and monitor my cluster.

But of course, I am a poor man so the entire demonstration will be on my laptop :D using Minikube (However you could use all these patterns and deploy on your own cloud platform). That’s the beauty of it all.

I will also reduce the scope of this article by focusing on one type of problem- We will select some specific toolset and a specific easy Machine learning problem and see if it makes sense to use Kubernetes to orchestrate the holy grail!! The article is meant for your inspiration only and it is my pet project, I am no way affiliated to any of the tools i will be using here. Essentially, i tend to gain nothing but share my experience and probably learn from the readers, so feel free to comment or connect with me to discuss more about Data science and typical engineering adventures related to it. I am happy to chat about it. If you are in Brussels, let’s grab a coffee!!

Problem Statement

So in order to have a precise scope, let’s define a problem. Here’s what I want to achieve from this experiment.

I will take this dataset(see below) for example and perform some exploratory analysis and build a model to see if it can predict whether a given person earn $50,000/year or not.

http://archive.ics.uci.edu/ml/datasets/Adult

This looks like such an interesting problem

Tools and technologies as a (pseudo) Data scientist, I would like to use

Jupyter notebook with python 3.6

scikit-learn

plotly or matplotlib

Tools and technologies as an ML engineer, i would need to write production-like code

Docker

python

API swagger

Git

CI/CD pipeline

Now we know our problem, let’s look at the architecture

Solution Architecture

Reference Architecture

Let’s describe each component one by one

Github ( Code repository and much more) — I will be using Github because of the following features i am in interested in

  • Peer review
  • Automatic Vulnerability scanning
  • Webhooks — Notify me on slack, trigger automatic code deployments to jenkins
  • Issue tracker such as Zendesk

— that’s me, so if you like what you read, feel free to visit my other repositories 😃

Docker Hub — For avoiding to create my own private registry, it is simpler to use docker hub, storing all my built images for the different apps which needs to be deployed. One could simply get a private registry on Dockerhub itself.

https://hub.docker.com/

Kubernetic — Kubernetic is a simple desktop client that can manage your Kubernetes cluster deployed anywhere. It allows you to add and manage your resources on Kubernetes and even provide options to add new pods/services amongst others. This is for mastering a view on your cluster and manage it smoothly

Let’s call this our external dependencies. Now we will see what we would like to set up inside our cluster

Helm — Helm is a Kubernetes package manager. This simplifies kubernetes deployments to a basic level and makes it as simple as “npm install”. We will use the Helm package to create our deployment packages for workbench and provide adequate workspace per use-case in terms of vCPU and Memory and we will Kubernetic to shut down pods on schedule when the work has been completed. Freeing it up for other jobs.

Jenkins — Our CI/CD friend, We will build different deployment pipelines using jenkins and even provide a beautiful wrapper pipeline which can deploy models to our industrialise namespace.

Minio — I had to give it some thought, simple Network Attached Storage or something a little more sophisticated, There was the option of using OSS Nexus or really go for a Object Storage with S3 API, Minio is cool, The idea for me is to store my models or the output of my models that can be shared.

ELK — I have kept it in grey, since at this moment, I will keep it out of scope. Infact monitoring models and your Model microservices is whole another subject, I would like to discuss in a different article

JupyterHub — We would need the team to be able to start up a new notebook with their own environment. We do so by installing JupyterHub using Helm. or for a demo, let’s just install a notebook

Great, Now we have all our big tools- time to build 🏗!!

Setup Guide

Let’s start simply by installing minikube

Minkube

use the installation steps provided here: https://kubernetes.io/docs/setup/minikube/

once you have Virtualbox installed, all i needed to do was

$ brew cask install minikube$ minikube version

Now that I have my minikube, let’s also download Kubernetic and install the desktop client

After installing Kubernetic, in order for it to receive the config and visualise our minikube cluster, let’s start our minikube

$ minikube start
Starting local Kubernetes v1.10.0 cluster...
Starting VM...Getting VM IP address...Moving files into cluster...Downloading kubeadm v1.10.0Downloading kubelet v1.10.0Finished Downloading kubelet v1.10.0Finished Downloading kubeadm v1.10.0Setting up certs...Connecting to cluster...Setting up kubeconfig...Starting cluster components...Kubectl is now configured to use the cluster.Loading cached images from config file.

Kubernetic

Once i have my minikube ready, let’s see how it looks in Kubernetic dashboard

Kubernetic Dashboard

I can see i have one node running on my minikube which is enough for the moment.

Helm

My minikube is up and running. We will first install Helm package manager and the instructions can be found here

if you are running your own minikube cluster, all you need to do is

$ helm init

now that we have helm installed, we can simply deploy all of our different tools for building up the workbench

Jenkins

Let’s start with jenkins, much of our deployment charts providing kernels to Jupyterhub will be enabled by jenkins itself. We will also build our CI/CD pipeline for our models through our Jenkins deployment pipeline.

We installed jenkins with Helm on our minikube by doing something as simple as

helm install --name jenkins --set Master.ServiceType=NodePort stable/jenkins

You can monitor the deployment in Kubernetic Dashboard. Since on minikube you can not deploy the LoadBalancer and hence the service type needs to be changed to NodePort #thankyoustackoverflow

This will install jenkins with NodePort and gives you certain instruction. Follow Step 1 to retrieve the password and then retrieve the url using minikube service instead of using Step 2 of the instruction

$ printf $(kubectl get secret --namespace default jenkins -o jsonpath="{.data.jenkins-admin-password}" | base64 --decode);echo
$minikube service jenkins --url

the second service will give you the URL to call.

user: admin
password: <password from step 1>

Voila!! We have jenkins running.

jenkins homepage

We will come back to it later

Minio

To install minio just using Helm

$ helm install stable/minio
$ export POD_NAME=$(kubectl get pods --namespace default -l "release=coiled-badger" -o jsonpath="{.items[0].metadata.name}")
$ kubectl port-forward $POD_NAME 9000 --namespace default

You can access minio at http://localhost:9000

Now, for the access key and secret

Go to Kubernetic Dashboard

Go to Config and view the config of coiled-badger. You can see your accessKey and secretKey

Voila!! You can now access minio from the browser.

p.s. to store the models, we will use the minio SDK and store models in a particular bucket.

Let’s go back to the Kubernetic Dashboard to visualise what is running so far

kubernetic Dashboard

In the Part 2, I will focus on deploying the pipelines for the Data Science Workbench on Jupyter Notebook instance along with a Database, followed up with the tutorial and see if we can deploy the model with our favorite tools

Stay tuned!!

--

--

Souvik Ghosh
strai
Editor for

AI enthusiast||Conversational AI||ML Engineer