Building an ML-Ops pipeline on AWS - PART1 Model training pipeline

Natarajan Mariyappan

4 min readNov 27, 2022

In this blog, I am going to explain end-to-end steps for building an enterprise MLOps pipeline.

At a high level, Here I will create two pipelines using CloudFormation in two-part series:

1. Part 1: Model training pipeline

2. Part 2: Model deployment pipeline

Part 1: Creating a CloudFormation template for the ML training pipeline

In this section, we will create two CloudFormation templates that do the following:

· The first template creates AWS Step Functions for an ML model training workflow that performs data processing, model training, and model registration. This will be a component of the training pipeline.

· The second template creates a CodePipeline ML model training pipeline definition with two stages:

- A source stage, which listens to changes in a CodeCommit repository to kick off the execution of the Step Functions workflow that we created

- A deployment stage, which kicks off the execution of the ML model training workflow

Now, let’s get started with the CloudFormation template for the Step Functions workflow:

Create a Step Functions workflow execution role called AmazonSageMaker-StepFunctionsWorkflowExecutionRole. Then, create and attach the following IAM policy to it. This role will be used by the Step Functions workflow to provide permission to invoke the various SageMaker APIs. Take note of the ARN of the newly created IAM role as you will need it for the next step.


{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "sagemaker:CreateModel",
            "sagemaker:DeleteEndpointConfig",
            "sagemaker:DescribeTrainingJob",
            "sagemaker:CreateEndpoint",
            "sagemaker:StopTrainingJob",
            "sagemaker:CreateTrainingJob",
            "sagemaker:UpdateEndpoint",
            "sagemaker:CreateEndpointConfig",
            "sagemaker:DeleteEndpoint"
         ],
         "Resource":[
            "arn:aws:sagemaker:*:*:*"
         ]
      },
      {
         "Effect":"Allow",
         "Action":[
            "events:DescribeRule",
            "events:PutRule",
            "events:PutTargets"
         ],
         "Resource":[
            "arn:aws:events:*:*:rule/StepFunctionsGetEventsForSageMakerTrainingJobsRule"
         ]
      },
      {
         "Effect":"Allow",
         "Action":[
            "lambda:InvokeFunction"
         ],
         "Resource":[
            "arn:aws:lambda:*:*:function:query-training-status*"
         ]
      }
   ]
}

2.Copy and save the following code block to a file locally and name it training_workflow.yaml. Below CloudFormation template will create a Step Functions state machine with a training step and model registration step. We are using CloudFormation here to demonstrate managing IaC. Data scientists also have the option to use the Step Functions Data Science SDK to create the pipeline using a Python script:

AWSTemplateFormatVersion: 2010-09-09 

Description: 'AWS Step Functions sample project for training a model and save the model' 

Parameters: 

    StepFunctionExecutionRoleArn:  

        Type: String 

        Description: Enter the role for Step Function Workflow execution 

        ConstraintDescription: requires a valid arn value 

        AllowedPattern: 'arn:aws:iam::\w+:role/.*' 

Resources: 

  TrainingStateMachine2: 

    Type: AWS::StepFunctions::StateMachine 

    Properties: 

        RoleArn: !Ref StepFunctionExecutionRoleArn 

        DefinitionString: !Sub | 

               { 

                  "StartAt": "SageMaker Training Step", 

                  "States": { 

                    "SageMaker Training Step": { 

                      "Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync", 

                      "Parameters": { 

                        "AlgorithmSpecification": { 

                          "TrainingImage.$": "$.TrainingImage", 

                          "TrainingInputMode": "File", 

                          "MetricDefinitions": [ 

                            { 

                              "Name": "train:loss", 

                              "Regex": "Average training loss: (.*?);" 

                            }, 

                            { 

                              "Name": "test:accuracy", 

                              "Regex": "Test set: Accuracy: (.*?);" 

                            } 

                          ] 

                        }, 

                        "OutputDataConfig": { 

                          "S3OutputPath.$": "$.S3OutputPath" 

                        }, 

                        "StoppingCondition": { 

                          "MaxRuntimeInSeconds": 86400 

                        }, 

                        "ResourceConfig": { 

                          "InstanceCount": 1, 

                          "InstanceType": "ml.g4dn.4xlarge", 

                          "VolumeSizeInGB": 30 

                        }, 

                        "RoleArn.$": "$.SageMakerRoleArn", 

                        "InputDataConfig": [ 

                          { 

                            "DataSource": { 

                              "S3DataSource": { 

                                "S3DataType": "S3Prefix", 

                                "S3Uri.$": "$.S3UriTraining", 

                                "S3DataDistributionType": "FullyReplicated" 

                              } 

                            }, 

                            "ChannelName": "training" 

                          }, 

                          { 

                            "DataSource": { 

                              "S3DataSource": { 

                                "S3DataType": "S3Prefix", 

                                "S3Uri.$": "$.S3UriTesting", 

                                "S3DataDistributionType": "FullyReplicated" 

                              } 

                            }, 

                            "ChannelName": "testing" 

                          } 

                        ], 

                        "HyperParameters": { 

                          "epochs": "4", 

                          "lr": "5e-05", 

                          "num_labels": "3", 

                          "train_file": "\"train.csv\"", 

                          "test_file": "\"test.csv\"", 

                          "MAX_LEN": "315", 

                          "batch-size": "16", 

                          "test-batch-size": "10", 

                          "sagemaker_submit_directory.$": "States.JsonToString($.SAGEMAKER_SUBMIT_DIRECTORY)", 

                          "sagemaker_program": "\"train.py\"", 

                          "sagemaker_container_log_level": "20", 

                          "sagemaker_job_name": "\"berttraining\"", 

                          "sagemaker_region.$": "States.JsonToString($.SAGEMAKER_REGION)" 

                        }, 

                        "TrainingJobName.$": "$$.Execution.Name", 

                        "DebugHookConfig": { 

                          "S3OutputPath.$": "$.S3OutputPath" 

                        } 

                      }, 

                      "Type": "Task", 

                      "Next": "Save model" 

                    }, 

                    "Save model": { 

                      "Parameters": { 

                         "ModelName.$": "$$.Execution.Name", 

                         "PrimaryContainer": { 

                           "Image.$": "$$.Execution.Input['InferenceImage']", 

                           "Environment": { 

                             "SAGEMAKER_PROGRAM.$": "$$.Execution.Input['SAGEMAKER_PROGRAM']",  

                             "SAGEMAKER_SUBMIT_DIRECTORY.$": "$$.Execution.Input['SAGEMAKER_SUBMIT_DIRECTORY']", 

                             "SAGEMAKER_CONTAINER_LOG_LEVEL": "20", 

                             "SAGEMAKER_REGION.$": "$$.Execution.Input['SAGEMAKER_REGION']" 

                           }, 

                           "ModelDataUrl.$": "$['ModelArtifacts']['S3ModelArtifacts']" 

                         }, 

                         "ExecutionRoleArn.$": "$$.Execution.Input['SageMakerRoleArn']" 

                       }, 

                       "Resource": "arn:aws:states:::sagemaker:createModel", 

                       "Type": "Task", 

                       "End": true 

                    } 

                  } 

                } 

Outputs: 

    StateMachineArn: 

      Value: !Ref TrainingStateMachine2

3. Launch the newly created cloud template in the CloudFormation console. Make sure that you provide a value for the StepFunctionExecutionRoleArn field when prompted. This is the ARN you took down from the last step. Once the CloudFormation execution is completed, go to the Step Functions console to test it.

4. Test the workflow in the Step Functions console to make sure it works. Navigate to the newly created Step Functions state machine and click on Start Execution to kick off the execution. When you’re prompted for any input, copy and paste the following JSON as input for the execution. These are the input values that will be used by the Step Functions workflow. Make sure that you replace the actual values with the values for your environment.:

{
  "TrainingImage": "<aws hosting account>.dkr.ecr.<aws region>.amazonaws.com/pytorch-training:1.3.1-gpu-py3",
  "S3OutputPath": "s3://<your s3 bucket name>/sagemaker/pytorch-bert-financetext",
  "SageMakerRoleArn": "arn:aws:iam::<your aws account>:role/service-role/<your sagemaker execution role>",
  "S3UriTraining": "s3://<your AWS S3 bucket>/sagemaker/pytorch-bert-financetext/train.csv",
  "S3UriTesting": "s3://<your AWS S3 bucket>/sagemaker/pytorch-bert-financetext/test.csv",
  "InferenceImage": " aws hosting account>.dkr.ecr. <aws region>.amazonaws.com/pytorch-inference:1.3.1-cpu-py3",
  "SAGEMAKER_PROGRAM": "train.py",
  "SAGEMAKER_SUBMIT_DIRECTORY": "s3:// <your AWS S3 bucket> /berttraining/source/sourcedir.tar.gz",
  "SAGEMAKER_REGION": "<your aws region>"
}

5. Check the processing status in the Step Functions console and make sure that the model has been trained and registered correctly. Once everything is completed, save the input JSON in a file called sf_start_params.json. Launch the SageMaker Studio environment you created here Using AWS ML Services, navigate to the folder where you had cloned the CodeCommit repository, and upload the sf_start_params.json file into it. Commit the change to the code repository and verify it is in the repository. We will use this file in the CodeCommit repository for the next section of the lab.

Summary

In this two-part series, we discussed the key requirements for building an enterprise ML platform to meet needs such as end-to-end ML life cycle support, process automation, and separating different environments. We also talked about architecture patterns and how to build an enterprise ML platform on AWS using AWS services. We discussed the core capabilities of different ML environments, including training, hosting, and shared services.

Building an ML-Ops pipeline on AWS - PART1 Model training pipeline

Written by Natarajan Mariyappan

No responses yet