SageMaker : How to build your Machine Learning Pipeline

3 min readDec 27, 2018

Background:

SageMaker is an excellent platform for model development, training and hosting activities. Whats even better is you can pick and choose the activities you want to leverage SageMaker for. In other words, it doesn’t mandate you to use its platform for all 3 phases. Want to bring your own model (BYOM) and use SageMaker just for training and hosting? No problem, it supports that. Have your own in-house dev-training pipelines and just want to offload hosting to SageMaker? It can do that as well! This article shows you how you can leverage SageMaker and AWS platform services to build end to end ML pipeline.

Problem Summary:

Typical ML workflows face following challenges:

Manual Orchestration for dataset and source code
Input and outputs not being versioned (i.e. data set and source code)
Manual tracking and logging ( i.e. Audit tracking changes: who, when and what?)

The goal is to build an end to end automated ML pipeline with following attributes:

Repeatable
Auditable (i.e. Logging and tracking)
Secure
Flexible
Collaborative.

This architecture presents a minimum viable pipeline (MVP) that supports above attributes and help you to get up and running fast. This architecture leverages AWS platform services in conjunction with SageMaker API.

Overview:

Pipeline can be triggered for following two scenarios:

Code is committed to github repo.
Training data set is updated in the S3 bucket.

Pipeline has two distinct logical stages:

Build: Compile and build your custom source application code and output a docker image
Training: Run training on the docker image with SageMaker API and output model artifacts to S3 bucket.

Steps:

1. Code is committed to source code repository by Data Engineer/Data Scientist OR the input dataset is updated with new data.

2. AWS Code Pipeline is triggered when new code is pushed to Github.

3. Lambda function is triggered whenever there is new training data in the training bucket. Lambda function in turn initiates the AWS CodePipeline

4. AWS CodePipeline begins the execution of the build stage.AWS CodeBuild pulls the data from GitHub source repository and builds the application package.

5. Docker container is created with the compiled code and pushed into AWS ECR service. This Docker container is available for consumption for the training stage.

6. Lambda function is triggered to execute a training job using SageMaker API, referencing the custom docker container

7. Training job is created on SageMaker. When training job finishes , it saves model artifacts to S3 bucket.

8. Lambda function is triggered to store metadata such as git hash, input data object version along with SageMaker Training job status/meta data in DynamoDB table. DynamoDB is used as a MetaData Store here. Lambda is also used to send out email notifications for job training status etc.

And there you have it. A simple end to end SageMaker ML pipeline satisfying following goals.

Flexible : Highly flexible for any type of the custom training model
Collaborative: Based on central Github repository. Multiple data scientist and data engineers can work on it simultaneously.
Repeatable: Pipeline activities are automated and consistently repeatable
Audit Logs: Audit logs can be saved in DynamoDB which can be used for tracking and auditing.
Secure: Data is encrypted at Rest and in flight.

SageMaker : How to build your Machine Learning Pipeline

Overview:

Written by Manas Narkar