Architecting MLOps: Decoding Our Path to Success in Choosing our MLOps Framework

Published in

Inside Doctrine

8 min readJul 3, 2023

by Aimen Louafi, David Huang and Ysé Wanono

Introduction

Artificial Intelligence is at the heart of Doctrine’s business. We have developed over 30 Machine Learning (ML) models over the years, which are still running in production today. However, our journey was not without obstacles. Due to technical debt, building or iterating our ML models was slow and cumbersome. Some tasks were performed manually.

To overcome this issue, we created a Machine Learning Lifecycle Task Force (MLTF) in January 2022 (See the blog post about it here). This task force consisted of three members from different squads, including two Machine Learning Engineers (MLE) and one Data Engineer (DE). To ensure that the task force was efficient, we committed 20% of our time, approximately five days per month.

The task force lasted a year and a half and during this time, we developed a methodology and gained valuable insights. We learned that the key to successful ML model building and iteration is a streamlined process that includes automation of manual tasks.

In this blog post, we will present our learnings, our methodology and the architecture we have chosen for our Machine Learning projects.

Gathering feedbacks and pains

Before exploring and testing various tools, we first decided to gather feedback from Machine Learning Engineers and clearly identify the pain points of our current pipelines. This will help us focus on the main issues we face on a daily basis when working on Machine Learning projects.

We interviewed them about the main pain points while working on a Machine Learning project at Doctrine.

After the interview process, we had a better understanding of the ML Lifecycle.

We consolidated the feedback into a spreadsheet and identified the most important pains to tackle for each step of the project lifecycle, i.e., the must-haves:

Dataset building: no dataset versioning strategy
Annotation: hard to setup, no multi-user handling, no consensus strategy when multiple annotations
Experimentation: having trouble accessing GPUs, reproducibility has to be improved, model comparison is difficult (often done manually), no hyperparameters logging
Production release: unclear deployment strategy for online deep learning models at scale, no way to easily benchmark a model before going in production
Monitoring: various tools are used, no drift measurement, lackluster alerting system

To get a clearer vision of the current situation, we listed all our production machine learning models. We rated each model based on its pipeline’s maturity and features already available (monitoring, benchmarks, access to the training script, …).

We sorted the models by machine learning problem-solving type (Named Entity Recognition, classification, etc.), the library used to build them (PyTorch, scikit-learn, etc.), and their business context.

State of the art and tested tools

We spent some time listing many existing tools on every step of the life cycle.

Of course, there are a plenty of choices and several new-comers every day. We will list later on the article the tools we tested in 2022.

We observed that most automation tools are quite new and owns a small community and user base. Gathering feedback or support is harder on emerging projects.

Some tools advertise themselves as AI platforms: they aim to manage every aspect of the lifecycle. But we found them too generic compared to tools that are focused on one single task of the chain.

Dedicated tools focus on a specific step of the ML lifecycle and are usually very comprehensive. However, this can lead to an excessive number of tools that are charged for managed service.

AI platform

We assessed 2 AI Platforms: Google Vertex AI and AWS Sagemaker.

Vertex AI: The ML lifecycle solution we evaluated included some convincing tools such as Hyperparameter Tuning Job and Endpoint deployment. However, we ultimately decided against this solution because migrating everything to the platform would be a significant undertaking. Indeed, Vertex strongly relies on GCP infrastructure (BigQuery, Cloud Storage, Google Registry), while our infrastructure is primarily on AWS.

Sagemaker: We found their solution for Hyperparameter Tuning Jobs not very intuitive. We suffered from the lack of documentation and the small developer community. Furthermore, using this platform can easily skyrocket your costs (ia persistent GPU instance on Sagemaker Studio or Sagemaker Real Time Inference). We decided to not go with the exhaustive Sagemaker Platform. But we still have retained a portion of AWS Sagemaker services that offer a managed and scalable infrastructure.

We decided to use specialized tools that are tailored for each step of the lifecycle, as they provide more comprehensive features.

Specialized tools we tested:

Annotation tools: Label Studio, Kili Technology, Prodigy
Artifact Management, Experiment Tracking: MLflow, Weights and Biaises, DVC
Training: KubeFlow, Ray, Vertex Training, Sagemaker Training
Model Serving: BentoML, Seldon, Sagemaker Inference.
Monitoring: Aporia, Cloudwatch, Datadog

Our new MLOps architecture

We ended up with this final MLOps architecture at Doctrine.

There are 7 main blocks:

“Data labelling” for annotating
“Exploration” for data science experimentation
“Hyperparameters tuning” to find the best hyperparameters combination
“Experiment tracking” to track and store all metadata and artefacts about a specific training
“Deployment” for model consumption in production
“Monitoring” for model performance and data drifting
“Artefact Management” to store datasets and models

As you can see, to tackle issues that our MLEs are facing, we tried not to reinvent the wheel and thus we mainly relied on open source solutions. Some of them are among the most popular MLOps solutions used in the community : MLflow, Label Studio.

During the whole MLOps tools benchmarking, we had the same drivers that helped us retain or reject a solution:

Refer to the pains reported by MLEs during the interview phase. Does the chosen solution address each pain point identified at a given stage of the lifecycle? Is it adapted to our use case? Are all functionalities really useful?
Validate the integration of the solution into the existing stack and processes. The tool was rejected when the code refactoring was deemed too significant compared to the benefits. Conducting a proof of concept on a real use-case greatly assisted us in making a decision.
Take into account company’s resources. In this case, we are talking about both financial and human resources. For example, Label Studio offers an enterprise version hosted on their cloud at a much higher price than the cost associated with our own cluster. At this point in time, we thought that the features associated with the paid version for our usage didn’t justify the extra cost. About human resources, we opted for solutions that would be easy for MLE to use, i.e. requiring little or no Infrastructure knowledge, in order to facilitate adoption. Promising solutions, such as Ray, for hyperparameter tuning, were discarded as they brought with them the underlying complexity of Ray cluster management.

You will find below an overview of the solutions selected at each step of the machine learning lifecycle.

Data annotation

Label Studio is a tool that convinced us thanks to its intuitive interface, its numerous annotation templates for different use cases, and its ease of use for multiple annotators campaign.

We chose the Community version for cost reasons. One of the main features missing from this version is user management, which could be very useful in the case of an outsourced campaign.

Exploration

For exploration purposes, we chose Sagemaker Studio which allows us to easily access several (AWS managed) CPU and GPU instances. We also made a custom Docker image available on notebooks to import custom functions.

Some downsides are the instance set up time, which can take up to a dozen minutes and the difficulty to setup notebook’s auto-shutdown.

Hyperparameter tuning

We set up a Prefect orchestrator to launch many independent trainings on Kubernetes nodes. We are aware that Prefect is nor widely used for hyperparameters tuning purpose. However, there are two main reasons for this choice:
- Some internally-developed models require several training hours and compute resources, which makes mono-instance solutions inefficient.
- We already have a Prefect orchestrator deployed internally. Having the right infrastructure to do hyperparameter tuning at scale was probably the most painful part for us.

Experiment tracking

We use MLflow to track each experiment. Just like Label Studio, we deployed the open source version on our Kubernetes cluster. At each MLflow run (equivalent to one model training), we store the model, hyperparameters, metrics, the training script source and the dataset, for reproducibility purposes.

So far, we are mainly satisfied with it. However, are missing a role-based access control system for certain features, such as the model registry.

Deployment

Sagemaker endpoint was the retained solution for endpoints deployment. One of the major advantages was the seamless MLflow integration which makes deployment possible with a few lines of codes without having to manage infrastructure during MLE’s workflow. Our model is logged as an artifact on MLflow using the pyfunc format. Pyfunc is a “model flavour” that helps to serve models in an unique and generic format, regardless of the python ML library being used under the hood.

There are variety of available instances with CPUs or GPUs offered with AWS SageMaker Endpoint, as well as the managed autoscaling

However, costs can quickly increase with a running instance 24/7 for a real-time inference endpoint, that’s why we only choose to deploy few endpoints. A solution in order to save costs is also to consider other AWS solutions like Async Inferences or Multi-Model Endpoints However there are not supported with MLflow-pyfunc yet.

Monitoring

There are two ways to monitor models: infrastructure monitoring and model performance monitoring.

For the first one, we rely on the official Sagemaker integration with Datadog to automatically send infrastructure logs to the Datadog application.

For the second one, we use the Datadog Alerting service to receive notifications in case of anomalies, such as data drift or prediction drift.

Artefact Management

For models and datasets management, we are simply using an s3 bucket with some naming conventions to organize folders. We chose this option primarily because of its simplicity. Our needs for models retraining is pretty low, so they’re not updated frequently. We don’t need anything behind the S3 bucket versioning options.

Next steps

After a year and a half of benchmarking and implementing MLOps tools at Doctrine, we can finally begin integrating them into our machine learning workflows. We are confident that this integration will help us improve our ML practices. It is important to keep in mind that this delivery was a first step and only contained the main features, not all of them. There are still improvements to make and we are aware that it is a continuous work and, meaning that if our needs change, the tooling also has to change.