Center image Source: Google Images

Docker and Kubernetes for Data Science

Tensorflow, pytorch, pandas, numpy, protobuf, dask, sklearn, keras, xgboost, lightGBM, scipy and the list goes on. An equivalent set of packages are available in R and other languages as well.

Every package has multiple versions that are maintained active and on top of it, there is nightly build to work on latest and greatest. Tensorflow Nightly, Pytorch Nightly, etc.

Package and version apart, there is dependency between every package and version. Pytorch might use X version of numpy as against tensorflow that might use Y.

It does not stop here, there is device accelerator dependency with packages. Tensorflow CUDA, Intel optimized Tensorflow and maybe tomorrow OpenCL, AMD GPU, etc.

Above dependency is fine if one is running experiments in a local laptop in a limited scale. Think of enterprise where hundreds of machine learning and AI projects running with tooling and infrastructure managed by a centralized IT team. Every project brings in its own dependency, packages and relevant version. IT team constantly ends up maintaining multiple custom environments at the same time ensuring old code does not break.

Even if the environment is assumed to be maintained by individual data science team that have their own custom environment it becomes difficult and time consuming for software engineers to replicate those custom environments in production.

If you think that’s all, no it is not. The difference is the operating system between the desktop to training servers to deployment servers and maybe to the cloud.

Now is that all, No.. But I will stop here and focus on the solution to overcome this custom environment challenges and one that brings in more collaboration between Data Scientist, IT Team, Software Engineers and other key stakeholders.

You got it. It’s Containers…While there are many container environment I am going to more talk on Docker and for orchestration containers we are going to focus on Kubernetes.

What is Docker?

Docker allows applications to be packaged in self-contained environments aiding in quicker deployments and bringing in closer parity with training or development environments.

What is Kubernetes?

Kubernetes automates container provisioning, networking, load-balancing, security, and scaling.

Kubernetes make development and deployment machine learning models Simple, Consistent and scalable.

There are many benefits of using containers and kubernetes and they come handy in the entire data science lifecycle starting from training till deployment.

You can also subscribe to my YouTube channel () to get alerts as I post new videos

Now coming to the topic... Containers play an important role in

📍 Infrastructure

📌 Enabling Multi Tenancy

📌 Hybrid Cloud

📍 Tooling and Reproducibility

📍 Deployment

Infrastructure

Infrastructure is one of the key investments in data science initiative. Enterprise has to invest on High-performance system/GPU to accelerate data science initiative.

If you look at model training typically there is heavy usage of Infrastructure for a brief period and post that cluster is almost idle. Kubernetes allows efficient sharing of resources and enabling multi-tenancy. This way multiple machine learning projects within an enterprise can share and utilize the infrastructure resources more efficiently.

One additional benefit kubernetes brings in is to support cloud-native as well as cloud-ready architecture. It is easy to build a hybrid cloud strategy that allows using on-premise resources and burst out to cloud as needed. This way on premise investment can be kept to minimal and cloud can be used as extended infrastructure keeping cost in control.

Tooling and Reproducibility

With Multi-Tenancy, every project might have their own tool and specific version of the tool. Containers help you create those virtualized environments with all dependency bundled in it.

Also with all dependency bundled now once a model is developed it is easy to pass on containers to software engineers to deploy rather sending lengthy installation instructions and dependency matrix.

Deployment

Once a model is trained just add a quick serving function to the container that loads the model and also bundles pre-processing pipeline dependency. The serving function here can be flask app or Java Spring application or TF serving or something else depending on model algorithm or tool used to develop a model.

The real benefit of kubernetes during deployment comes from scaling resources on need basis to meet business demand. During peak volume scale the serving instance pods and at normal peak run pods on usual capacity.

You can also define deployment strategies like A/B deployment, Blue green and canary. This can help perform zero-downtime deployment as well as enable champion challenger deployment strategies.

Data Driven Investor

from confusion to clarity, not insanity

Srivatsan Srinivasan

Written by

Data Scientist | Data Engineer

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade