Enhancing Developer Experience and Machine Learning Workflows with MLOps

Published in

99P Labs

11 min readMar 22, 2024

Written by: Tony Fontana and Luka Brkljacic

Are you curious about the practical application of MLOps? Or perhaps you’re interested in learning about the day-to-day experiences of a machine learning engineer? This blog delves into the current state of our machine learning platform and our team’s adoption of MLOps practices. Whether you’re a seasoned ML professional or just starting your journey, you’re sure to find valuable insights in the content that follows.

At 99P Labs we have been supporting data scientists and software engineers with a research platform since our team’s inception. Our goal of software research has been supported by a community of university partners and internal collaborators. To make working together simple and efficient, and to smooth onboarding we have offered a developer portal and data science environment for our developers and scientists. Over the years we have deployed many versions of these platforms, if you want to read about one of our previous versions please you can check out this blog. We also have a blog describe our developer and data community that you can read here.

In the earlier version of our data science environment, the core aim was to bring together all our developers and data scientists onto our platform. They had the freedom to develop Python code on their individual machines, subsequently submitting Spark jobs to our cluster to distribute workloads. However, this resulted in each developer managing their own coding environment, which proved to be hurtful to the overall user experience. Not only was the interface poor, but it also discouraged collaborative efforts among team members.

The benefit of this system is that the data scientists could submit their heavy workloads to a cluster that had much more capability than a user’s local machine, meaning the data scientists were able to generate heavy models and run big jobs. One drawback of this was converting python jobs into spark jobs (which is java based). This translation was an inefficient step, and we wanted to upgrade in our next data science platform.

We also had increased our internal team capability and wanted to start generating more large machine learning models and LLMs for our research projects. Our company, Honda Research Institute US also holds strong capability and ML engineers across the company. We wanted to increase collaboration across the company and within our own team of ML engineers and data scientists. Our team wanted to take our DevOps first mindset when building a platform and adopt this into a MLOps mindset. This is focused on creating a strong platform to help enable machine learning engineers first. You can read more about having the MLOps mindset in a blog I wrote previously found here.

Our new developer experience and ML platform is supported by Kubeflow. An open source toolkit that allows data scientists and ML engineers to collaborate, experiment, deploy their own code and serve ML models. Kubeflow is containerized and runs in Kubernetes, which makes it a perfect fit into our existing platform structure. This helps reduce the workload for the platform (DevOps, MLOps) engineers and make the DS and ML engineers more self-reliant. The ML engineers can create their own data pipelines using python code and then deploy their code in a distributed way into containers in the Kubernetes cluster. The users come to the platform with their browser, code their scripts, and applications and use a python library to add pipeline capability. This then deploys containers on the backend that run the jobs and have output. The output is saved into persistent storage, which models and notebooks can access later. The model also gets served in its own container and has an API endpoint that can then be used for applications. This workflow is to run models in a production environment quickly, and securely. This user experience is a great step forward from our previous environment. Previously users could only write code locally, and run scripts in a cluster, but after there is no direct output besides the job completion. No persistent storage and no way to access a created model. This new iteration with Kubeflow provides an entire end to end development, research, and production platform for machine learning projects. Here are some of the features of Kubeflow that our ML engineers appreciate:

⦁ Experimentation tools
⦁ Workload scale
⦁ Versioning + collaboration tools
⦁ Storage access in the cluster
⦁ Self service running of code and pipelines
⦁ Self service model deployment
⦁ Production environment

Lets walk step by step through our new developer experience:

Besides having dynamic capability, we want our data scientists and ML engineers to have a good development experience when using anything on our platform. Lets walk through an overview on the developer experience and starting, creating, training, and deploying a model on Kubeflow:

The machine learning engineers and data scientists reach Kubeflow in their browser, login and access their python notebooks. If they are starting a new project, they create a new notebook and can first start with experimenting on their dataset. They bring the dataset in by accessing it through s3 or minio storage. At the start of any project, they’re not usually sure about the exact approach they’ll take. That’s why it’s key to have a space for quick trial-and-error, it allows the engineers to be self sufficient and not need access to any platform engineers to get started and make mistakes and find solutions quick.

2. Once the DS / ML engineer has a good understanding with the structure of their project, which libraries they are going to use, which methods for creating and training a model, then they can start to break everything into functions. These functions then can be transitioned into components of Kubeflow pipelines. Our ML engineers report that this process is a bit difficult and has a learning curve for the Infrastructure as Code library- kfp (Kubeflow pipelines). This is a library that allows the user to create data pipelines right from their Juptyer notebooks, with an Infrastructure as Code process. This process is currently being developed by the open source community and we are excited for kfp to improve and support more features, as it is a bit complex currently. This is probably the biggest complaint with our ML engineers, but after working with our MLOps engineer they were able to figure out how to make it work.

3. After developing the ML pipelines the user can run this process and this creates a pipeline object in Kubeflow. This pipelines then has a history of each run and each test of the pipeline. This is one of the main reasons and benefits of Kubeflow. This whole process does not require a platform engineer to deploy code, monitor the deployment, and make iterations. You can see which step in the pipeline failed or tune parameters without going back into the code. This means once the pipeline is created anyone with access could rerun this process with different parameters to fit their own needs. Each step in the pipeline gets deployed into its own container, which helps distribute the compute across your cluster. And you can run certain processes asynchronously to help speed up processes. This is much more efficient than running your code on a normal VM, as it allows for very large scale and distribution.

4. After running the ML pipeline, next is serving the model using KServe. This deploys the model in Kubernetes, where it runs inside its own container and has an endpoint for applications to refer to a model. For example, if the model was trained for image classification, you could write an application that submits an image to the model and gets a classification and confidence level returned.

This project is open source, which is exciting for us. Right now, we’re working closely with the Kubeflow community to report any issues and offer support. Even though we spot a lot of room for the project to grow and get better, we already find it really useful. Plus, it’s got a bunch of other features that we didn’t cover in this blog post, just to keep things short.

Our team outlined three use-cases for getting familiar with and testing the new platform: Simple Image Classification, Telematics Data Clustering & Classification, 99P Blogs Dataset LLM Chatbot.

Use-case 1: Simple Image Classification

The first use-case uses a well-known image dataset, CIFAR-10, to create a simple image classifier using a Convolutional Neural Network (CNN). This dataset consists of 60,000 32x32 color images in 10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck), with 6,000 images per class. The goal for this use-case was use our CNN model to correctly classify new images based on the training data. For example, given a new picture of a dog, the goal is for our model to correctly label that image as being of a dog.

To accomplish the aforementioned goal, we converted the following high-level steps into a Kubeflow pipeline. Each of these steps will be explained in further detail below:
⦁ Loading the dataset
⦁ Pre-processing the data
⦁ Defining the CNN model
⦁ Training & evaluating the data
⦁ Serving the model

The first step of the pipeline is very straightforward. We simply load the from tensorflow’s keras datasets and split the data into training and testing sets. We then save the data as a kfp artifact so that we can pass the datasets along to other pipeline components.

The next step in our pipeline then takes the output artifacts from step one as inputs. This data is then normalized, and some logging metrics are recorded. These metrics will be accessible from the Pipelines UI. Finally, we save the processed data as kfp artifacts.

The third step does not take in any input, but simply defines our CNN model and saves it as a kfp artifact to be passed along the pipeline.

The fourth step of the pipeline is where most of the actual work happens. This component takes in the CNN model, processed training and testing sets, and user-defined parameters as input. It then uses these inputs to compile, fit, and evaluate the model. As part of this process, two metrics are saved which can be viewed from the Pipelines UI: a confusion matrix and the model’s loss and accuracy. Finally, the trained model is saved as an artifact.

The final step in our pipeline takes the trained model as input and serves it using KServe. This creates an endpoint that can be queried with a simple JSON request, as well as viewed in the Endpoints UI.

Use-case 2: Telematics Data Clustering & Classification

The second use-case built on the pipeline framework described in use-case 1. This time, we used synthetically generated telematics data (generated from real Honda driving data) to cluster drivers based on their driving behavior as defined by six unique driving-related features (speed, longitudinal acceleration, latitudinal acceleration, steering angular velocity, brake input pressure, acceleration pedal position). These clusters were then added to the data as labels. This labeled dataset was then split into training, testing, and validation sets and used to train a tensorflow classifier neural network. As in the first case, the trained model was served using KServe and queried with a simple JSON request.

Use-case 3: 99P Blogs Dataset LLM Chatbot

The 99P blogs chatbot was developed to be deployed on our platform, not as a Kubeflow pipeline. It is currently operational, and you can view it here. Our intention with the third kfp use-case is to take this LLM application and transform it into a Kubeflow pipeline. This work has begun and uses the same basic framework described in the first two use-cases, however due to the complex nature of the LLM application and the fact that it was not built with kfp in mind it is going much more slowly than the first two use-cases. Still, we are confident that we will soon finish this final use-case and round out our first-round testing of Kubeflow.

ML Engineer’s perspective on Kubeflow

I came into Kubeflow with no prior knowledge or experience of the product. Given this, the initial learning curve was very steep and it took about a week of detailed digging through documentation, forums, and trail-and-error to finally get comfortable with Kubeflow pipelines (kfp) and the development workflow in general. However, once I became comfortable and everything “clicked”, the experience was fairly smooth overall.

First, let’s get into the pain points. The biggest troubles I had working with kfp had to do with things that will be fixed in the near future, so I will not go into great detailed about them. These included having to go through DevOps for debugging, issues related to kfp versioning, and no GPU access. Other than these issues, I didn’t experience any big roadblocks. There are definitely things that I wish could be different, such as how metrics and visuals are displayed in the Pipelines UI and the convoluted and proprietary Artifacts framework. These things made my development experience slightly more painful and longer, but ultimately did not block and movement toward our end goal. Next, let’s talk about what went well.

In my opinion, the biggest strength of Kubeflow is that everything is in one place. You can spin up a Jupyter notebook with as many resources as you need, do your development in that notebook, run and schedule pipelines, then view those pipelines and their individual components within the Pipelines UI. Once everything looks good, you can go back to your notebook, serve and query the model, and once again be able to view it in the Endpoints UI. The fact that you can switch between a coding view and a visual representation of your code is super helpful in the development process. Additionally, being able to organize projects using Experiments and Pipeline versions is a huge win as it allows for very quick and easy iterative development.

Overall, I think Kubeflow — and particularly kfp — can be a great tool given some initial time investment and patience. The updates that are coming soon should also greatly improve the user experience, increasing the value proposition for teams even more.

Conclusion

Our team’s experience with Kubeflow has been positive, and we are excited to continue leveraging the platform for their machine learning research and development efforts. The ability to create end-to-end ML pipelines, from data ingestion to model serving, has been a game-changer, empowering the data scientists and ML engineers to be more self-sufficient.