Should Data Science teams use Kubernetes? Hell no!

Data science teams should focus on analysing data and building models, not infrastructure management.

Image for post
Image for post
Photo by Lucas van Oort on Unsplash

Kubernetes is great!

  1. “Kubernetes is a future proof solution.” Because it is super cool to say “future proof”. Nobody knows how the future looks like, but everyone seems to be sure that Kubernetes will be part of it. I have to agree that the trend indicates that Kubernetes is the way — until something cooler is invented.
  2. 99.9999999999999999% availability. Maybe I got excited with the 9’s, but there’s plenty of them. It is not 100% because it will fail when you need it the most. Just kidding, a good implementation of K8s can ensure almost flawless reliability of your application.
  3. Cost reduction aka Scale to zero. This is a super cool feature that helps you saving infrastructure costs when the application does not need computational power!

It is indeed a great tool! And by taking a look at the pros, it seems to make a lot of sense to use it for machine learning — both kubernetes and machine learning are tech jargon words used to describe the future, so it makes sense to have them together. Seriously, auto scaling for the time and computational consuming processes like model’s training? Makes total sense to use it.

But should data science teams start working with it?

Photo by Caleb Russell on Unsplash

Kubernetes is not so great

  1. Kubernetes is the new oil. Oh wait, that’s data! Then why are data scientists spending tons of time on kubernetes and not on the data?
  2. Kubernetes is a great tool and it is becoming the new standard for cloud applications. It is so cool that it is being used not only for software applications but also for machine learning. What we shouldn’t forget is that building software applications and machine learning models is not the same.
  3. Kubernetes has plenty of components, processes, services, subsystems, jobs, code… That’s great for someone that is expert on the subject but for data scientists it means plenty of risks or problems. Not to mention all the routing and networking services, reverse proxy’s, and so on!

What are we missing here?

A great tool for data science and it should be used by data scientists? That’s right. Kubernetes is an infrastructure tool, hence, it should be used by infrastructure specialists.

There’s a new movement regarding this topic and it is called MLOps. I joined a community that is making the first steps on the movement — MLOps.community — and surprisingly it grew for a bunch of people (~40) trying to figure out how to make processes simpler for Machine Learning into a strong community (~900) sharing knowledge, creating webinars and tools. It is agreed that the skills needed to scale ML using Kubernetes are much different from the ones needed to build models. However, there’s not yet a common standard for the role of these people in a company: some are called Machine Learning Engineers, other kept the previous role of Infrastructure Engineer, DevOps or SRE, and some are innovating by being named “AI Infrastructure Engineers”.

To give you an idea on how complex it is to create MLOps culture and processes, there are companies specializing on it. Yes, now that you are thinking on delegating this job to that one person, think that there are entire companies, maybe bigger than yours, working on this. MLOps platforms, AI platforms, DataOps platforms, Data Science platforms are all similar and focus on solving scalability for data science, among other technicalities.

Think well before deciding between building internally versus buying.

Conclusion

Data scientists are spending tons of time dealing with containerization and kubernetes in general. This is not a task for data scientists, it is a new specialized role that is emerging and is infrastructure related. Let data scientists work on their field and create value for the company, instead of having them dealing with issues out of their scope.

Gonçalo Martins Ribeiro is CEO at YData.

Improved data for AI

YData provides a privacy by design DataOps platform for Data Scientists to work with synthetic and high quality data.

YData

Improved data for AI

Gonçalo Martins Ribeiro

Written by

Co-Founder & CEO| YData.ai

YData

YData

YData is a data-centric platform that democratizes the access to valuable data and allows Data Scientists to build and deploy better AI solutions with high-quality and synthetic data

Gonçalo Martins Ribeiro

Written by

Co-Founder & CEO| YData.ai

YData

YData

YData is a data-centric platform that democratizes the access to valuable data and allows Data Scientists to build and deploy better AI solutions with high-quality and synthetic data

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store