Why Your Data Team Needs MLOps

And How to Create a MLOps Platform

Tony Fontana
99P Labs
5 min readDec 19, 2023

--

At 99P Labs one of our research areas is platform engineering and software development capability, DevOps, and MLOps. Every year we have iterated on versions of a platform that can support and enable all types of data scientists and software engineers. To follow along with our progress you can read our blogs about platform engineering and our data science environment.
Our preexisting data science environment shares a lot of similarities with a modern MLOps platform that can enable any Machine Learning engineer or data scientist the same. But after attending MLOps world in Austin, Texas (Read the write up of the experience here), I got to brainstorm concepts on how our team can expand our capabilities to help enable ML Engineers across the entire company become Machine Learning ready, to not worry about infrastructure or compute so they can focus on writing ML code.

MLOps: Code vs. Infrastructure

This picture is a great example describing the importance of ML Code as compared to how important the infrastructure is for deploying this code. A Machine Learning model can only be as good as the infrastructure surrounding it, supporting the development, experimentation, and “productionization” of the model.

As our team’s capability grows past data science and local Machine Learning development, we want to develop the golden standard of Machine Learning infrastructure and processes, to help create efficiency within our team and with our research partners. Right now time is wasted with trying to create a pipeline for every single experiment and running everything without automation. If every team we partner with and our in-house engineers all can speak the same standardized pipeline language, then projects can kick off more quickly, with much less confusion and effort required.

We want our Machine Learning engineers to focus on coding and models, not the supporting platform and infrastructure. This is the same concept with DevOps and platform engineering. To apply this concept to Machine Learning takes everything a step farther.

DevOps and MLOps both go a little something like this:

  • Developer creates a script / application / code
  • Developer pushes code to git repository
  • Code automatically runs through a build pipeline
  • Code executes some process and has an output (artifacts) — This could be a model being created or an application compiled
  • Some final action — This could be the model being deployed or application running
  • Continuous iterations of this process

Our existing platform is capable of this process in a DevOps sense, where we can automatically build and run applications, we can even run Spark and Python jobs this way. You could technically create models with this process but it’s not as clean and does not come with as many features.

There are toolkits that can extend a platform’s capability to become Machine Learning and MLOps ready. We decided to use Kubeflow but have looked at multiple projects incluing MLflow. Kubeflow is an open source container orchestration system that integrates directly with Kubernetes infrastructure. It deploys and manages tooling as containers, making it easier for users to access applications without having to know how the tooling runs. On the other hand, MLflow is a Python program that is agnostic of Kubernetes. While this simplicity provides benefits, it can also be a drawback for organizations that rely on Kubernetes infrastructure.

Kubeflow v. MLflow

Kubeflow integrates directly into our Kubernetes based platform and everything is run as a container with complete scale and simple management. We also build all our applications and jobs as microservices and break everything into as small of components as we can. Starting from such a strong base with Kubernetes, containerization, and microservices around 5 years ago has made adoption of new processes and new applications very simple. That is one strong recommendation I would make to anyone starting MLOps: focus on creating a strong software platform first, which does not have to be complicated. Here are some of our recommendations:

  • GitOps Mindset — Version everything
  • Git repositories (We use Gitea)
  • CI/CD Process (We use Argo CD and Jenkins)
  • Kubernetes platform (Scale, open source, easy management)
  • Infrastructure as code (We use Terraform)

Since we had this infrastructure developed we were able to deploy Kubeflow on our platform and integrate it to our identity management pretty simply. Let me now discuss about what Kubeflow brings to our already mature platform:

  • Automation of jobs
  • Its own pipeline framework, allows Python users to create automations of their scripts to be created simply and stored as infrastructure as code (Lets go KFP!!)
  • Collaborative environment, which lets Data Scientists and Machine Learning Engineers to work together on scripts before models are created. Very nice for experimentation.
  • User interface, allows noobs or experienced users to have control over their automation, scripts, models, deployments, etc.
  • Automatic resource allocation, the people writing the ML code do not need to worry about what resources they have available, they can just submit jobs and the Kubernetes cluster and MLOps (platform) engineers have already configured what resources they can use.
  • Model serving, it makes it very easy to go from model creation scripts and training to deploying a model into production (all from a UI, might I add)

We hope in the next year to enable Data Scientists and push them into Machine Learning Engineers. Using the same concepts, we want to push Platform Engineers DevOps + MLOps role, where they enable both Machine Learning and software equally.

The line is becoming blurred for what a ML Engineer should do and what a Data Scientist should do, but modern platforms help enable the entire process and do not silo or box one into a certain role. On Kubeflow the experimentation environment allows both roles to work together to create scripts, modify and clean data, and create tests. The ML Engineer and MLOps Engineers can then work together to create infrastructure as code scripts to create pipelines for reuse and model deployment. None of these must be defined to a specific role, as Kubeflow allows you to control who has access to what process, and you could let a Data Scientist become more familiar with the process and give them access to deployments and pipelines, for example. MLOps really has just become allowing ML Engineers and DS pros to work on their models and code without having to worry about what infrastructure and resources they have access to.

--

--