Our ML Ops platform reduces development time significantly. Here’s how.

Published in

ANWB data-driven

11 min readJan 20, 2023

Introduction

Imagine that your organization has a well-functioning data science team. However, you notice that a bigger and bigger fraction of your data scientists’ time is spent on deployment and model maintenance in production. Furthermore, you see that an increasing number of incidents arise due to a decreased model performance (e.g., because of data drift), which demands more precious time from your data scientists to fix. This is time which you could have spent on building new models and exploring new use cases. What is the solution? Automating repetitive and time-consuming tasks with a Machine Learning Operations (ML Ops) platform.

So what are Machine Learning Operations?

The term Machine Learning Operations describes the entire process around creating and publishing data science models. With a Machine Learning Operations platform, the aim is to standardize and automate as many steps as possible in creating data science models and getting their outputs to the stakeholders.

About 2 years ago, our data science team saw the first signs of losing development time to maintenance and monitoring models in production. Thus, we began our journey towards a well working ML Ops platform. We started with an initial list of questions:

What do we need to deploy and maintain Machine Learning models in production?
Which platform can help us achieve these goals?
How can we further improve the platform to better fit our data scientists’ needs?

After working through this list of questions, we ended up with ACEmaker, our Python wrapper around AWS Sagemaker that we use to develop, deploy, and maintain our Machine Learning models. It is named after our data science team, the Analytics Center of Excellence (ACE). In this blog, we will tell you about why we chose to build ACEmaker, how it works, and what the data scientists at ACE think about the platform.

What do we need to deploy and maintain Machine Learning models in production?

First, we identified four main phases in the data science development pipeline for which an ML Ops platform can provide support. The first two phases support the data scientists with model development, and the latter two provide support with deployment and maintenance. You might imagine that the latter two phases, deployment, and maintenance, are the bread and butter of an ML Ops platform. While this is true, benefits are certainly gained from including the first two phases in your ML Ops pipelines.

In the Design phase, a data scientist explores and transforms the data to experiment with different models. In this stage, it is helpful to provide the best possible development environment and support automatic experiment tracking.

In the Test & Train phase, the data scientist has chosen an appropriate model, which will be trained and tuned to get the best performance. Here, we can help the data scientist the most by automating pipeline creation, hyperparameter tuning and instance scaling. This ensures hassle-free development.

In the Deployment phase, the trained model is made available to the key stakeholder. Depending on their requirements, the model output could be added as an extra column in a database or as a recommended piece of content on a website. Both batch and streaming solutions need to be supported, e.g., through an API or a direct database call. Furthermore, we want to keep track of different versions of the models that we create through a model registry so we can analyze which changes were made over time, and what the impact of those changes were on the model performance.

In the Maintaining phase, we want to be able to monitor the model performance. Is the data that we are using for inference similarly distributed to the data that we used to train the model? Are we getting a similar model performance, or is the model performing worse over time? Can we create automatic triggers that indicate when it is time to retrain the model? Can we also see how a model came to a prediction using explainable AI?

Which platform can help us achieve these goals?

These days, there are a lot of ML Ops solutions out there. We experimented with a few of them, and finally chose to use AWS SageMaker. We were already using AWS for our data storage and our previous, custom-built model deployment solution, and we liked that it is very flexible and allowed us to customize it to our needs.

How can we further improve the platform to better fit our data scientists’ needs?

While SageMaker provides a lot of nice tools that enhance our data science workflow, we encountered a hiccup: alot of code was still necessary to make it all work! Furthermore, for our use cases, a lot of the code was boilerplate and could be automated away. As such, we decided to develop our own stack based on SageMaker which enables us to further automate and enhance the data science workflow. In 2021 and 2022, we started working on our solution. Enter: The ACE ML Ops platform.

The ACE ML Ops platform consists of two parts: A ‘front-end’ part called ACEmaker, and a ‘back-end’ part called the Platform Stack.

The front-end: ACEmaker

To aid in the process of developing, training, deploying, and monitoring a model, we created a wrapper around SageMaker. As we primarily use Python for all data science work at ACE, we created this wrapper in the form of a Python package. It abstracts away a lot of the boilerplate code you would normally have to write to use Sagemaker.

When using ACEmaker, our data scientists split their code up into distinct functions that follow key parts in the data science workflow: Data extraction, data processing, data validation, model training, model tuning and inference. This allows for containerized execution of each part. In other words, each part is separated by code and independently run on Sagemaker through Docker containers. To test this while in development, we use a Python script called run_components.py with which you can call each of the functions. You specify the function name, what package requirements it has and (if necessary) which instance type you want to use.

def preprocess_data(traffic_df): 

    """Go through several preprocessing steps and return a train and test set 

    Returns: 
        dict: containing a train set and a test set as pd.DataFrames (values)  
              and their respective names (keys) 
    """ 

    traffic_df = clean_traffic_data(traffic_df)
    traffic_df = filter_frequencies(traffic_df) 
    traffic_df = create_target(traffic_df) 
    traffic_df = filter_traffic_data(traffic_df) 
    traffic_df = create_features(traffic_df) 
    traffic_df = clean_column_names(traffic_df) 
    train, test, val = train_test_split(traffic_df) 
      
    return {'train': train, 'test': test, 'val': val} 

## In run_components.py; used during testing 

# Preprocessing 
processing_step = project.data_processing(
                              'functions/data_processing.py:preprocess_data', 
                              input_data='s3://xxxxxxx/xx/output', 
                              run_mode='cloud', 
                              instance_type='ml.m4.10xlarge', 
                              requirements=['pandas==1.5.0', 'numpy==1.23.3', 
                                            'scipy==1.9.1', 'pytz==2022.4'])

After you have successfully finished writing out your code for the different steps, you can turn your components into a pipeline. You do this by creating a .yml file describing what the pipeline will look like. This is very similar to run_components.py, but then in a .yml format. This YAML file gets used to replicate the pipeline in the preprod (and later prod) environment.

## In pipeline_config.yml, used during automatic deployment 
## As you can see, very similar to the python version, just in YAML format 
## Here we show the processing step, of course have more steps
## Each step is chained to each other in a specific order to make it work

pipeline-steps: 
    ... 
    processing-step: 
        type: "Data processing" 
        function: "functions/data_processing.py:preprocess_data" 
        input_data:
          - extract-step.output_data_location 
        requirements: 
          - "pandas==1.5.0" 
          - "numpy==1.23.3" 
          - "scipy==1.9.1" 
          - "pytz==2022.4" 
        instance_type: "ml.m4.10xlarge" 
    ...

As you can see, the use of this package offers a lot of advantages. As mentioned before, it saves the data scientists and engineers from writing a lot of boilerplate code. An added benefit is that it adds to the standardization of all data science models. It makes it really easy for a new data scientist to use, even when working on older models.

There is still one thing missing before we can enjoy these benefits: We have to build up our Cloud stack so we can actually make use of all the resources available in AWS. Let’s take a look at the Platform Stack!

The back-end: ACE Platform Stack

As the name implies, the platform stack is our Infrastructure-as-Code solution for building and maintaining the AWS back-end in which our models are developed, trained, deployed, and maintained. In simple terms: this part contains the code that is used to build and configure our resources in the Cloud. Aside from the AWS side of things, we also use GitLab for both source control and CI/CD. The platform stack contains code that makes our GitLab environment work with AWS through a GitLab Runner. More on that later, let’s talk about creating Cloud resources first.

CDK

We use AWS CDK (Cloud Development Kit) for building up the infrastructure part of our Cloud platform. This means specifying which resources (like storage or compute) we need, and how they need to be configured. It also includes setting up the necessary rights and policies to make sure everything is safe. We use CDK by importing it as a package in Python and importing premade constructs, which represent premade combinations of resources in AWS. You have different levels of constructs, each with their own customizability. You can use this, for example, to automatically create an S3 bucket while limiting the amount of configuration you need to do. Take a look at the following code snippet for example:

from aws_cdk import App 
from stacks.s3_object_lambda_stack import S3ObjectLambdaStack 

app = App() 
S3ObjectLambdaStack(app, "S3ObjectLambdaExample") 
app.synth() 

# (code snippet from https://github.com/aws-samples/aws-cdk-examples/)

As you can see, creating a new S3 bucket only requires a few lines of code. If you want to change something about the configuration, you just change the code and redeploy it with the command line. AWS will take care of the rest. Easy right?

Aside from being able to quickly change the configuration of our infrastructure to fit our needs, using the CDK provides even more benefits. For example, it enables us to use source control and makes it possible to use Python functionalities like code reuse, importing functions and more for building our environment. For more information and benefits for this approach, I recommend looking at the AWS CDK Developer guide.

So, what exactly does the back-end consist of? And how do we use CDK to build it? We’ll follow the four development stages mentioned earlier, starting with the Design stage.

Let’s start with SageMaker Studio. This is the environment in which data scientists can interface with AWS SageMaker. In simple terms, this is a Machine Learning IDE in the Cloud. This takes the form of an extended Jupyter lab environment. In SageMaker, data scientists can write code in Jupyter Notebooks, access the model registry, see whether their model pipelines were run successfully and more. This is also where our data scientists import ACEmaker and write their data science code. We have also just finished integrating a Visual Studio Code Server into Sagemaker Studio as an alternative development environment, which shows why it is nice to choose a flexible platform to build your ML Ops tools on.

In order to facilitate Training and Testing the models we developed in SageMaker Studio, we make use of SageMaker Pipelines and SageMaker Experiments. Sagemaker Pipelines automate steps of the Machine Learning workflow like loading and transforming data, feeding it into a model for training and tuning hyperparameters. We use the code written in ACEmaker to further automate this process by generating the pipelines automatically. After generating these pipelines, we can use them to experiment with different types of models, data features etc. by just changing the code, pushing it and rerunning the pipeline. SageMaker automatically tracks these experiments, allowing us to analyze our experiments in a structured way and getting to the best model as quickly as possible. After completing this step, the model is registered to the model registry, allowing us to keep track of all finished models.

Once we have finished experimenting with different models, and we have found our best model configuration, it’s time to Deploy it to the development environment! We use Lambdas to create SageMaker Endpoints, API Gateways, and a model monitoring schedule, all based on the settings set in the original ACEmaker code.

We can then Monitor the Model and serving pipeline in SageMaker Studio by using SageMaker Monitoring. We can view the logs through CloudWatch.

After we are happy with the model in our Dev environment, we can automatically deploy the model to preprod. This is done by simply clicking a button in out GitLab environment. This will then recreate the pipelines, train the latest model, and register it in SageMaker Model Registry (after approval by the data scientist) and finally create the endpoints. We can then test whether the model endpoint works with the systems of our stakeholders, and if all is well, redeploy to our production environment with the same process.

GitLab

As mentioned a couple of times before, we use GitLab for our source control and CI/CD. To interface with AWS from our GitLab environment we use GitLab Runners. These are applications that can execute GitLab pipeline jobs in our AWS environment. We use these to run the CDK commands to (re)deploy certain parts of our environment based on changes we made in the code. These CDK commands are run in AWS by the GitLab runner in an EC2 instance. So, all a data scientist or engineer needs to do is click a button in GitLab, and the runner will take care of the rest.

What do our data scientists think?

At ANWB, we work hard to inform our members about traffic conditions on the Dutch roads (link for reference). Typically, we report on current traffic jams including expected delays. We also offer advice in advance when rare circumstances are expected, such as harsh weather conditions in the coming days.

However, we would like to expand our services to include a live traffic forecast. To this end, we are launching a Proof-of-Concept forecasting model that predicts whether there will be slow traffic in 30 minutes at several locations. During this phase, we will first internally validate the performance of the model before we move on towards production. We use sensor data from traffic loops in the road that capture the average speed and intensity of traffic every minute. We will use the following locations:

The next step is to capture the training part of our Machine Learning development cycle into a pipeline. We follow the conventional steps, meaning that we extract historical data which we preprocess before training a model. After model evaluation, the model is registered in our AWS SageMaker Model registry but only after it satisfies our performance metrics constraints. After registration, we can schedule our serving pipeline to grab the latest available data, preprocess it and perform forecasts. An overview of these steps are displayed in the figure below.

As you can see, ACEmaker is an amazing way to speed up machine learning development. It allows a data scientist to quickly repurpose their experimental code into Dockerized components and organize them into logical pipelines for deployment, without having to dive into the nitty-gritty of Machine Learning engineering concepts. It is therefore no longer required to pass code along to a data engineer for deployment as the data scientist can do most of the work themselves. This allows for quick prototyping, without going back and forth between team members and frees up a lot of precious time.

Something for you?

So it’s official: building an ML Ops platform is no easy feat. It requires a lot of planning, research, and experimentation if you want a custom-built solution that fits the needs of your data science team. However, we feel that Machine Learning Operations is the next big thing to tackle in your team’s journey to data science professionalization. Are you ready for the next step?