Building a Machine Learning Orchestration Platform: Part 2

Rashmi Itagi
Code & Wild
Published in
6 min readSep 29, 2021

In Part 1 of this publication we focused on how, here at Bloom & Wild, we build artefacts for our Machine Learning models and algorithms using AWS ECS Fargate. These Model Artefacts take inputs, process them and produce outputs in a pipeline. The Platform team provides some resources to compute and to store the data that gets processed through this pipeline.

Part two of this series focuses on the use of Directed Acyclic Graphs (DAGs) to orchestrate the execution flow of the Model Artefacts using AWS Step Functions. Step Functions lets you build visual workflows that enable fast translation of business requirements into technical requirements. With Step Functions, you can set up dependency management and failure handling using a JSON-based template.

Enabling the Data Team to run their Models

Based on the commonalities found between the models in the Part 1 of this series, we built the artefacts with a set of requirements and pushed them as Docker images to AWS ECR repositories. Next, we had to find a way to allow our Data team to autonomously work with DAGs to run their ML models.

Requirements for this part of the project were:

  • The definition of AWS State Machines stays with the Data team so they can independently manage their DAGs
  • Our Data team works in Python so we used Jinja as a template engine for interpolation of environment variables in state machine definition. (State machines can be easily and clearly defined as Infrastructure as Code independent of any language used for application development)
  • Need for scheduling the cron jobs to run the state machines at defined intervals
  • Support for Multiple state machines with complex interdependent models to run

The figure below shows a high level diagram of the implementation of AWS Step Functions that enabled the data team to trigger a built container image via AWS ECS Fargate

Figure 1: High level diagram showing implementation of AWS Step Functions
Figure 1: High level diagram showing implementation of AWS Step Functions

As part of finding the right abstractions with the given requirements, we split the problem further, into two parts:

  • Creating a Github Repository which will contain Infrastructure as Code for creating AWS Step Functions, State Machine definitions and scheduling AWS CloudWatch Rules to run the defined State Machines at intervals.
  • Setup AWS IAM permissions that are project and environment specific which isolate projects from each other.

Creating the Github Repository

This will consist of the following:

  • A CodePipeline pipeline, that will trigger on each commit made to this repository. We have two different pipelines for staging and production
  • A Dockerfile, which will build a Docker image with the application and run it
  • We have defined the AWS Step Functions’ State Machine Definitions as templates. The application will look for template files in the templates/ folder. Each template follows the Jinja templating format for interpolation of variables
  • The templates use the following environment variables as these are required because the state machine definitions use these values to build names and AWS ARNs:
  1. environment: contains the environment the application runs in, for example, production
  2. aws_region: the region the template will run in, for example, eu-west-1
  3. aws_account_id: the AWS account ID that hosts the Step Functions, for example, 0123456789
  4. aws_subnet_ids: a list of subnet identifiers, as a comma separated list of strings
  • A set of schedules which define the cron jobs to run the Step Functions that are created. The application will read the schedules/schedules.json file. This file is a JSON array of objects with a single key/value pair. For each key/value pair the application will create a CloudWatch Rule on a given schedule that will trigger a Step Function. The key of the object is the Step Function that should be triggered, and the value of the object is the schedule expression, in the CloudWatch Event format. Only for the production pipeline the CloudWatch Rules will be enabled. On staging they will still be created so the schedule can be checked, but they will be disabled so no Step Functions on staging get triggered automatically.

Here is an example to explain this. Let’s consider creating a repository called data-dag having a file called template/data.json.jinja with the following code in it.

{  "StartAt": "Run an ECS Task and wait for it to complete",  "States": {    "Run an ECS Task": {      "Type": "Task",      "Resource": "arn:aws:states:::ecs:runTask.sync",      "Parameters": {        "Cluster": "arn:aws:ecs:{{aws_region}}:{{aws_account_id}}:cluster/{{environment}}",        "TaskDefinition": "{{environment}}-example-model",        "LaunchType": "FARGATE",        "Overrides": {          "ContainerOverrides": []        },        "NetworkConfiguration": {          "AwsvpcConfiguration": {            "Subnets": [{{aws_subnet_ids}}],            "AssignPublicIp": "DISABLED"          }        }      },      "End": true    }  }}

And another file called schedules/schedules.json with the following code in it.

{“data”: “cron(0 * * * ? *)”}

When the application runs on production, for example, it creates an AWS Step Function State Machine with the following state machine definition

{  "StartAt": "Run an ECS Task and wait for it to complete",  "States": {    "Run an ECS Task": {      "Type": "Task",      "Resource": "arn:aws:states:::ecs:runTask.sync",      "Parameters": {        "Cluster": "arn:aws:ecs:eu-west-1:0123456789:cluster/production",        "TaskDefinition": "production-example-model",        "LaunchType": "FARGATE",        "Overrides": {          "ContainerOverrides": []        },        "NetworkConfiguration": {          "AwsvpcConfiguration": {            "Subnets": [              "subnet-01a23b45c67d89e",              "subnet-01a23b45c67d89ef"            ],            "AssignPublicIp": "DISABLED"          }        }      },      "End": true    }  }}

The figure below shows the visual workflow for this State machine

Figure 2: AWS State Machine Definition Visual Workflow
Figure 2: AWS State Machine Definition Visual Workflow

An AWS Cloudwatch Rule called production-data-dag scheduled to run every hour at the minute 0 of the hour is also created. This setup triggers an AWS ECS task defined in the production-example-model task definition. This task definition is configured to use the latest image from the AWS ECR repository that was pushed by the AWS CodePipeline.

Setting-up AWS IAM Permissions

Our Data team has quite a number of models already and we will have many more in the near future. This led us to think about limiting access to different models by different groups of engineers depending on the different models they will be working on. However, this would not be very helpful to restrict access to limited models as they are all a team together and might need access to the models at some point that they don’t directly regularly work on. Hence, we considered the use of naming conventions to resolve this.

We created namespaces for each group of engineers and for data models. Then, configured the Data team’s IAM users to have permissions to assume a service role to work with an individual model in a particular environment. For example, we allow a user to assume a role called ds-staging-foo-model-user-role whilst the user wants to trigger an ECS task to run the foo model in a staging environment. With this, all the users in the team have access to all the models they need whilst they are still being used as separate accesses. This helps put the users in a better position to worry less of making changes to any other model accidentally than the intended model.

Conclusion

To summarise, we managed to split the infrastructure and the code in such a way that the data team has autonomy on DAGs, which allows them to test, deploy and schedule them independent of the platform team’s involvement. The platform team simply provides a service, which allows the data team to self-serve and develop, test and run their data models. Hopefully this will mean more autonomy for them, less repetitive work for the platform team, and faster delivery of ML models that benefit the business and our customers.

--

--