“Creating a Scalable ML Pipeline Infrastructure: Leveraging Amazon SageMaker and AWS Step Functions to Streamline Machine Learning Workflows for Production”
All industries, both big and small, are increasingly making use of machine learning. It is therefore more important than ever to incorporate a universal ML pipeline to fit most projects.
One of the most common challenges in setting up such a pipeline is how to build a bridge between the model training and service layers automatically, so that whenever the model gets trained the service layer should update seamlessly with a fresh model.
This blog post will explain how we built our universal ML pipeline for production with AWS SageMaker, AWS Step Functions, AWS CloudWatch, and AWS Lambda.
What did we want to solve?
In order to satisfy the common requirement of our ML project (continuous training and deployment), we had been using many different AWS services. This resulted in technical overhead, exacerbated by our small team sizes. To solve this we wanted one standard ML infrastructure solution which we could ideally then utilise for most of our existing and future ML projects.
How we manage to solve it?
Most of our ML pipelines follow the below pattern: Data Preparation
Model Building Servicing
Below is our high-level architecture which will help us to understand how we solved this problem at MyHammer.
This diagram explains the development and deployment flow of the full ML Pipeline. Let’s go over the setup in more detail below:
Step 1
For the snippet template please use AWS bring your own algorithm to the Amazon SageMaker repo from Github.
Write the ‘training’ inside the ‘train’ file of the template:
Write the service layer in the predictor file:
Develop the whole solution locally and write all unit test cases and integration test cases for the pipeline and the service.
Step 2:
After local testing, it should then be pushed to GitLab. Git should then execute the docker build and test the codebase.
If the changes are only related to the API then we do not need to run the training and only should update the relevant endpoints. For this, we have a git CI step of updating the endpoint directly. So we directly push the image to ECR and update the SageMaker endpoint from Git.
If the changes are related to model training, then we push the image to ECR and the updated version of the docker image will get picked up in the next run by the step function.
Step 3:
Here in the next stage, we set up a step function that will pick the image from ECR and deploy the training and service to SageMaker. State machine code snippet :
Here the line with ‘Sagemaker-Role-With-Required-Permission’ should have all the permission for Sagemaker to run the training and deploy the endpoint. This role should have access to all the AWS resources and services which your code requires to execute the training and deployment of the endpoint. Eg: If your training is accessing an s3 bucket then you should have those permissions attached to the role.
As per this design, all intervals of execution in the start step-function will check the availability of the endpoint with the help of the lambda function given below. If it is available, it will take the path to only update the endpoint. If it’s not available then it will take the path to create an endpoint.
This check happens on all runs of the pipeline. Lambda function to check endpoint availability :
Step 4:
Scheduling the step functions execution from Cloudwatch by writing some simple rules.
Why it is important?
- Faster delivery of the product as it is a one-time setup effort.
- Less infrastructure management so fewer resources required.
- Easy to scale
- Benefits of SageMaker like Hyperparameter Optimization
- One platform/service solution for most ML products.
Conclusion:
This is a completely universal solution — it doesn’t matter which algorithm is run for training or what kind of code is written for the service layer. As it is a one time infra setup, it’s a simple matter of replicating the functionality for most universal ML projects, especially those which demand regular model updates.
This also frees up a lot of overhead in terms of technical management. This design uses minimum infrastructure services. This feature of minimal and clean design also helps new team members to onboard and become productive faster.
We can also write a validation test to check the data and concept drift after and before completing the training. The training will only pass if the validation with the model is successful. So in this case service layer endpoints would only get updated if the training passes successfully. In this way, we can control bad model promotion to production.