MLOps Part 2: Machine Learning Pipeline Automation with AWS

Jack Sandom
Slalom Data & AI
Published in
6 min readMay 26, 2020

In part 1 of this blog series, we explained the challenges around making machine learning projects successful and introduced MLOps as well as a view on what a mature MLOps capability looks like. In this post, we examine how AWS and infrastructure-as-code can be leveraged to build a machine learning automation pipeline for a real-world use-case.

Reusable Infrastructure-as-code

We have built the following to get started on leveraging AWS services for MLOps.

  • Standard architecture for an automated and serverless MVP MLOps training and deployment pipeline
  • Infrastructure-as-code using Terraform to easily provision necessary AWS services
  • Sample use-case (employee attrition model) utilising interpretable bring-your-own model approach and batch inference

Standard Architecture

The diagram shown below is an example of what an automated MLOps pipeline could look like in AWS. This is for a batch inference model deployment where a persistent endpoint is not required (although it is optionally available as part of the infrastructure build). It uses a serverless architecture which has benefits in terms of cost-efficiency and ease of development. The key AWS components are:

  • AWS Step Functions for orchestrating the various jobs within the pipeline and incorporating logic for model validation
  • Amazon S3 for initial data storage/data lakes, storing flat-file data extracts, source code, model objects, inference output, and metadata storage (the initial feature store might also be a relational database e.g. Amazon RDS and alternatively, Amazon DynamoDB could be used to store metadata for some use-cases)
  • Amazon SageMaker for model training, hyperparameter tuning and model inference (batch or real-time endpoint)
  • AWS Glue or ECS/Fargate for extracting, validating and preparing data for training jobs
  • AWS Lambda for executing functions and acting as a trigger for retraining models
  • Amazon CloudWatch for monitoring SageMaker tuning and training jobs

The architecture here was made to address a particular use-case (people analytics attrition model) but the intention was to design it in such a way that it can be used generically across different ML problems. Of course, this will not meet all ML use-case requirements but many of the components are key to almost all ML systems in AWS (S3 storage, SageMaker hyperparameter tuning, training, and deployment). This is intended to be a starting point for establishing proper ML pipeline automation with many re-usable elements that can form the foundation for a mature ML system.

In reality, many use-cases will also require other AWS services such as Kinesis (for streaming data) or EMR (for big data requirements) but this blueprint will fit a large number of project uses. The key takeaway is that pipelines like this are valuable as they skip notebooks altogether. This means it’s so much easier to move to production and manage everything like a software pipeline.

Infrastructure-as-code

Infrastructure-as-code (IaC) was used to provision and configure the aforementioned AWS services needed to create an MLOps pipeline. The main benefits to using IaC here are:

  • Speed: faster time to production and faster, more efficient development (get ML models to production in days or weeks… not months)
  • Consistency: less configuration drift and issues at deployment (ensure that models meet the high standard required to be run in a production environment)
  • Cost: reduced time and effort leading to cost savings and improved ROI (less investment to get to production and more spent on operational models rather than expensive POCs)

Terraform was used for the IaC development due to the declarative code style, immutable infrastructure, and large community of users.

The MLOps Terraform module is available within the Slalom Infrastructure Catalog for DataOps deployments. There is a set of input variables that allow the end-user to configure the deployment based on the use-case. These include setting hyperparameter ranges, training job definition (e.g. instance size, number of training jobs), evaluation metric, inference type (endpoint or batch), and more. To get started, please see the usage documentation available here.

Sample use-case (Employee attrition model)

To show how the MLOps module can be applied to a real-world use-case, a sample module was developed which operationalises a people analytics attrition model. The model was built from the IBM HR Employee Attrition dataset on Kaggle. The key requirements and solutions for this use-case were as follows:

Not many changes are required to the standard architecture except for adding Amazon Athena to the inference output (for connecting Tableau for data visualisation) and using Amazon ECR for bringing our own model to SageMaker.

As we are only working with a single dataset here, we have not included components for monitoring the model in production and triggering based on model decay. Instead, the retraining and redeployment steps are triggered by new data landing in S3.

To bring this to life for end-users, we created a predictive dashboard that visualises the model output. The data connection which inputs into the dashboard (via Glue Crawler and Amazon Athena) is continuously updated by the MLOps pipeline.

The data and code to re-create this use-case can be found in the sample module within the Infrastructure Catalog and be provisioned within minutes from the repo.

What this brings

The re-usable assets presented above demonstrate quick and easy provisioning of AWS services for automated model training and deployment. This is something that can be leveraged for a number of machine learning use-cases and bring about greater levels of automation with a view to enabling machine learning project success.

Next Steps

MLOps is a vast subject and as we saw earlier, there are many elements that should be considered as part of a world-class ML system. For the infrastructure module that we have developed, there are a few immediate enhancements for consideration.

  • Real-time prediction service: the module currently has the option to deploy our model to a SageMaker API endpoint. In order to serve that endpoint, a service is required which can process incoming data and call that endpoint. This is relevant to use-cases which require real-time inference rather than the batch inference used for the attrition model
  • Performance monitoring: this is especially important for higher velocity ML problems which require real-time predictions and therefore should have a means to measure production performance and model drift
  • Trigger on model decay: linked to the above, model performance declining in production triggers the ML training pipeline
  • Metadata driven pipeline: details related to training and hyperparameter tuning (e.g. time to train, parameters used) guide future model training configuration

Some other considerations include additional attention to security and how to reduce risk by building extra privacy and security controls, how to more easily bring source code from experimentation into the pipeline, and incorporating the ML pipeline into a robust automated CI/CD system (level three in our maturity scale).

How we can help

MLOps is an area that organisations will have to consider increasingly as AI/ML capabilities develop. At Slalom, we are well-positioned to help clients tackle these challenges. We bring together expertise across advanced analytics, machine learning, data engineering, and cloud & DevOps. Also as an AWS Premier Consulting Partner, we are used to getting maximum value from the AWS cloud. Please reach out to hear more!

Note:

As mentioned in part 1 of this blog series, the reasons that machine learning projects fail has as much to do with culture, people, and organisational reasons as they do technical. Please look out for future posts on these barriers to production by David Frigeri (Director, Slalom Philadelphia).

Thank you to Rob Sibo, AJ Steers, Andrew Garfinkel, Jon Kelley, Reuben Hilliard, and Hilary Feier.

Jack Sandom is a Data Scientist out of Slalom’s London office. He specialises in machine learning and advanced analytics and is a certified AWS machine learning specialist. Speak with Jack and other Data & Analytics practitioners at Slalom by reaching out directly or learn more at slalom.com.

Slalom UK

--

--