Building a Machine Learning Orchestration Platform: Part 1

Published in

Code & Wild

11 min readSep 3, 2021

A while ago we started looking at ways to modernise our Machine Learning at Bloom & Wild. One of the first things the Data Team wanted to do was to start building ML models and algorithms and to have a way to orchestrate its execution. We can think of a ML algorithm as an application that gets some data as an input, gets trained on that data so it can learn from it, and then can make informed decisions about outcomes when new data is fed into it. The ML model is the output of that whole process. Until recently we relied a lot on tools that allowed us to build simple pipelines (mainly using tools like AWS CodePipeline).

The problem with that is that pipelines often restrict you to have a linear execution, whereas in reality, what we often need is for those ML models to be able to feed each other in slightly more complex ways.

Figure 2: A Directed Acyclic Graph (DAG for short)

For this, we need to think of our ML flow as a Directed Acyclic Graph (DAG for short), where models do not feed each other in a purely linear way, but rather in a more complex and interdependent manner.

We started building out some of these models and soon realised there was a lot of repetition for each one, and it was also creating some blockages given that as soon as the Data Team needed a new model, someone from the Platform Team had to spend some time setting it up.

This approach was not scalable, and so this offered us a great opportunity for collaboration between the Data Team and the Platform Team. In a true MLOps fashion we got together to build a platform that would enable the Data Team to build, test and deploy their models with as much autonomy as possible.

The building blocks of the platform

In a nutshell, what the Data Team needed was a platform with the following components:

Model Artefacts: these are the base of all the work, you can think of them as black boxes that, when run, take some inputs, do some processing on those inputs, and produce some outputs. In order for these models to run, they need some compute and memory resources, and some data stores where they can save their intermediate results or their outputs.
DAGs to orchestrate the execution flow of the Model Artefacts: once all the models are built and can run, we need a way to connect them together. The DAG should be able to run these models in certain order, under certain conditions and on certain schedules. This is important partly because the output of a model is often part of the input of another model, so the relationship between them is crucial.

Choosing the right technology

Often there are many ways to solve a problem. This was no different. We run most of our infrastructure in AWS, so finding a solution that could be built in AWS was key for us.

We considered using something like Amazon Managed Workflows for Apache Airflow (MWAA for short) because Apache Airflow is a very well known tool in the Data Science world to do this kind of work, and has tons of features and a huge open source community behind it.

The other alternative on the table was to take a more of a DIY approach and split the project into two big parts:

Use AWS ECS Fargate to build the Model Artefacts: our models are Python applications, and Docker has proven to be a good tool to be able to build and test reproducible artefacts that can be then deployed and ran on multiple platforms. ECS Fargate is a managed service that lets you effectively run any Docker container on demand, without the overhead of having to manage your own clusters or fleets of EC2 instances.
Use AWS Step Functions to build the DAGs: Step Functions is a service that allows you to build workflows to orchestrate the execution of other AWS services. It has built in mechanisms to handle errors, retries, parallelisation, and it integrates very well with many other resources we use in AWS like ECS tasks (which run our models), Lambdas, S3 buckets, and so on. It’s also all managed by AWS and scales automatically which is a big bonus for us.

After careful consideration and weighing pros and cons of each option, we decided to go ahead with the DIY approach. The main drivers for the decision were running costs and the fact that, for now, most of the bells and whistles of something like Apache Airflow were not required for our use case. Given that all our models would still be packaged into Docker images and pushed to ECR, switching to MWAA was always an option as our runners would only need to then pull those same Docker images and execute them.

Enabling the Data Team to build their Models

In this first part of Building a Machine Learning Orchestration Platform, we will focus on the work the team did to build a platform that would enable the Data Team to fully own the process of creating a new Model. In a followup article we will cover the work done to enable the DAG work.

The first step was to find the commonalities between all the models the team needed to build, and from there find a suitable set of tools to get started. We settled on the following requirements:

Each model would be a Python application that will get packaged using Docker.
The models would need to store their computed results somewhere, and would need to also have access to a temporary data store for internal calculations or caching. For the storage of artefacts we settled on S3 buckets, and for the temporary data store we decided to use DynamoDB.
For each model we wanted a deployment pipeline that will check out the code, build the Docker image from the latest commit and push it to an ECR repository.
For each model we also wanted an ECS service and task definition that will allow the Docker images to be run on demand.
We expected the ECS tasks to be triggered by a variety of events: AWS CLI or CloudWatch Rules for development and testing or Step Functions are just a few examples.

Below is a high level diagram of the proposed solution.

Figure 3: High level diagram of the proposed solution

Finding the right abstractions

After having decided on the underlying infrastructure and tools, the next step was finding the right abstractions in order to reduce the cognitive load on the Data Team. After all, what they ultimately wanted is to run their Python code in production in a safe way and to have the means to monitor its performance and diagnose any issues. The Platform Team’s job is to provide the internal tools, services and processes so that other teams can do just that without having to go down to the nitty gritty of provisioning and fully understanding all the underlying infrastructure.

In order to do that, we broke down the problem into two parts:

Creating a template GitHub repository for creating new models.
Provisioning of all the underlying infrastructure using a Terraform Module.

Creating a template repository for a new model

When developing a new model there’s a lot of boilerplate work that needs to be done every time. A few examples of tasks that will be the same for each model:

An initial Python project skeleton, with some internal pip packages that are shared among each model.
A Dockerfile to successfully build the Docker image for the model.
The AWS CodeBuild spec files that will be used as part of the CI/CD pipeline
Some helper scripts to build, run and test the models locally, if needed.

By creating a GitHub template repository, we make the process of starting the development of a new model a very simple process, as the only required steps are to click on the “Use this template” button in GitHub, give your repository a name and a description, and then replace a few templated values in a few places.

If there is interest in looking at what a trimmed down version of our template repository looks like, we have created a version that is publicly accessible on our GitHub profile, under the name opensource-ml-model-template.

Creating a Terraform module for provisioning new models

The second part of the work was to create Terraform code to provision all the required infrastructure for a new model easily.

Some underlying infrastructure is shared among models and was created and maintained by the Platform Team separately. Examples of shared resources could be:

A common ECS Fargate Cluster where all ML models can run.
VPC configuration so we can run our ECS tasks securely and isolated from other services.
A CodeStar connection to connect our GitHub repositories with AWS CodePipeline.

What was really key for us when it came to Terraform was to ensure the Data Team could still self-serve the provisioning of new infrastructure as much as possible. With the right abstractions and some compromises on convention over configuration, we ended up with a setup where this was all the terraform code required to spin up a new ML model in our AWS infrastructure:

Behind the scenes, this Terraform module is creating all the infrastructure pictured in Figure 3:

An AWS CodePipeline triggered when code gets pushed to a specified branch in a GitHub repository.
An AWS CodeBuild project, triggered by the pipeline, which will build the model docker image and push it to an ECR repository.
An ECR repository to store the Docker images for the new model.
An ECS service and associated Task Definition so the Docker images can be run.
An S3 bucket for the model to store its outputs.
And all the underlying artefacts that support the pipeline and ECS, such as CloudWatch Log Groups for logging and IAM roles and policies to allow access to the required resources.

The beauty of this is that all of the above complexity is buried and can be maintained and updated by the Platform Team, whereas the consumers of the module don’t need to worry about any of these things, and only need to be aware of high level concerns such as where does the code lives, what is the model name, and what environment should this run on.

How and when the actual infrastructure is provisioned will depend on what kind of Terraform flow is implemented in your organisation.

As with the model GitHub template repository, we have also created a slimmed down version of our Terraform module. It is available in our public GitHub profile as well, under the name terraform-aws-ml-model. With these two GirHub repositories, a fully working solution should be deployable to AWS out of the box.

Managing secrets and configuration

Another requirement from the Data Team is to have autonomy to manage their own models configuration and secrets. For this we decided to use AWS SSM Parameter Store. Each ECS task would get permissions to read and decrypt parameters under a unique path, and also certain members from the Data Team would get appropriate IAM permissions to also put parameters under those same prefixes, allowing them to effectively inject any configuration or secrets into their workloads.

The power of conventions

We’ve mentioned that our GitHub Template Repository as well as our Terraform module have opinions. We took the approach of favouring convention over configuration because this allowed us to cover 80% of our needs while greatly simplifying the underlying code required to get us there. For example, we decided to make extensive use of prefixes in our cloud resources, which simplified the amount of work we had to do when it comes to IAM permissions by carefully using wildcards (more on that on Part 2 of this series). With careful use of other naming conventions, the usage of some of our other observability tools like Datadog were easier to implement.

We tried to bake in some flexibility in our Terraform module to allow for some customisation, like the ability to create extra resources, or the ability to specify some ad-hoc IAM permissions (so for example a model can also read from an S3 bucket from another model, to use it as an input).

Putting it all together

As a summary, this is what the whole process now looks like when a new model needs to be created:

Create a new GitHub repository from the template opensource-ml-model-template.
Make the necessary changes to some of the templated values if needed, and push it to your own GitHub account. For the sake of this example we will assume it has been pushed to the repository github.com/foo/bar.

As for the infrastructure part, this will assume you have a working AWS account, with some infrastructure already there like an ECS cluster and a VPC to run your workloads. If that’s the case, that’s all the Terraform code you should need to provision a working pipeline and runtime environment:

Note that some values in there will need to be replaced by your own, and that some of them might be of sensitive nature, so make sure you handle that correctly.

After running terraform apply and applying the plan, you’re all set up.

Push new code to the main branch of your repository and the pipeline will trigger:

CodePipeline executing after a code change

This pipeline will push a Docker image to the newly created ECR repository:

And our ECS task definition is also ready to be used:

We can try the ECS task out by triggering an ECS execution via the AWS CLI with a command such as this:

aws ecs run-task --cluster <your ecs cluster name> --task-definition production-example-model --launch-type FARGATE --network-configuration "awsvpcConfiguration={subnets=[\"subnet-xxxxxxxxxxxx\",\"subnet-xxxxxxxxxxxx\"],assignPublicIp=\"DISABLED\"}"

The ECS cluster name, VPC ID and subnets values will have to be replaced by the real thing. After running the command above, here is what appeared on our Datadog logs console:

Which means the task executed successfully and had the correct configuration to be already aligned with the rest of our services when it comes to observability.

Final thoughts

Overall both the Data and the Platform teams were quite happy with the solution built. The Data Team has achieved huge amounts of autonomy when it comes to building and deploying the ML models in our AWS infrastructure. All the coding they require when it comes to infrastructure is just a few lines of Terraform, and all the rest is handled and managed by them in the GitHub Template Repository for their Python projects. On the other hand, the work required on the Platform Team has also been decreased considerably. The Terraform code still goes through a Pull Request Review process and a Platform Engineer is still ultimately required to hit the merge button and get the new Terraform code applied, but given the simplicity of the Terraform module this can be done in a matter of minutes.

Before embarking on this project, each model had to be created from scratch, with quite a lot of a back and forth between the teams, and with lots of work having to be repeated over and over again. This could easily take more than a week to complete. After we finished this work, getting a new boilerplate model up and running in production is something that can be easily completed in half a day.

Stayed tuned for the Part 2 of this series where we will cover how we also built a framework to create the DAGs that would run all these models.