Simple MLOps #1: Continuous training pipeline

Tales Marra
6 min readOct 7, 2023

--

We all know the importance of re-training a model. It ensures that your model stays up to date with your data, and keeps the good performance through its life cycle.

However, it doesn’t have to be a complex process! In this first article of the Simple MLOps Series, you’ll learn how to simply implement a continuous training pipeline!

Understanding Continuous Training Components

The basic components you need are a trigger, a pipeline for data processing and training, evaluation and deployment, and a registry.

  • Trigger: The trigger is your MLOps ignition switch. Depending on your strategy, opt for a scheduler or a database check to determine if it's time to launch training. This decision hinges on the volume of new data received, ensuring efficient resource allocation.
  • Data Processing and Training Pipeline: The core of continuous training resides here. You can choose between a container-based or serverless compute resource for the training process. This step yields your model and triggers the subsequent phase.
  • Evaluation and Deployment Pipeline: This pipeline can share the compute resource chosen for the previous step. It's imperative to employ the same environment as your inference. Here, evaluation spans infrastructure, business metrics, and classic model performance metrics.
  • The Registry: The safe haven for your models, the registry is where you store the latest iterations and maintain older versions as backups

In this article, we will guide you on how to implement these components using popular frameworks such as Terraform, aws-cli, and Docker.

Setting up credentials and tools

Before diving into the implementation, it’s essential to install aws-cli and set up the necessary credentials. Additionally, you need to install Terraform, a tool that provides a consistent CLI workflow for managing and provisioning infrastructure.

Creating an ECR repository to store Lambda image

Once you have everything installed and permissions set up, you can create an ECR repository where the image for the lambda function will be stored. This can be done by running a specific command in aws-cli.

aws ecr create-repository \
--repository-name ct-image-repo \
--image-scanning-configuration scanOnPush=true \
--region region

Setting up .env file

To ensure the security of your credentials, we’ll use .env to load environment variables. This is where you’ll store your AWS region, ECR repo, and function name.

AWS_REGION=(YOUR AWS REGION)
AWS_CT_ECR_REPO=(YOUR ECR REPO)
FUNCTION_NAME=ct-function

The training function

The Lambda function will handle both the preprocessing of the data and the training of the model. For our example, we’ll train a model to predict if the insurance charges of someone will exceed 10k. The code for this can be found on GitHub.

Packaging code and dependencies using Docker

Docker is used to package our code and the required dependencies. We’ll build a docker image that will install our requirements and package the code.

FROM public.ecr.aws/lambda/python:3.8

# Install the function's dependencies using file requirements.txt
# from your project folder.

COPY requirements.txt .
RUN pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
# Copy function code to /var/task
COPY lambda_handler.py ${LAMBDA_TASK_ROOT}

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "lambda_handler.lambda_handler" ]

Placing the Docker image in the cloud

To push your image to the ECR repository, you’ll have to build, tag and push it. You can use your Just file recipes to make your life easier if you’re already familiar with Docker. If not, I invite you to go to the Just file and familiarize yourself with them. Once you’ve done it, just run:

just build-ct-image
just tag-ct-image
just push-ct-image

Deploying Infrastructure

The rest of the infrastructure will be deployed using Terraform. Terraform is an open-source infrastructure as code software tool that provides a consistent CLI workflow for managing and provisioning infrastructure. The main.tf file is the default filename for a file which defines what Terraform will do. Let’s break it down in parts!

  1. AWS Provider Configuration:

The Terraform script begins by configuring the AWS provider. Here, we specify the AWS region as eu-west-3, indicating that all resources will be created in the European Union (Paris) region.

hcl
provider "aws" {
region = "eu-west-3"
}

2. ECR Repository Declaration:

Next, we declare an Amazon Elastic Container Registry (ECR) repository named ct-image-repo using the aws_ecr_repository data block. This repository will store our container images.

data "aws_ecr_repository" "ct_image_repo" {
name = "ct-image-repo"
}

3. IAM Role and Policy for Lambda Execution:

To grant the Lambda function execution permissions, we define an IAM role named ct-role with a trust policy that allows AWS Lambda to assume this role. We also create a policy named lambda-execution-policy that permits Lambda to invoke functions.

resource "aws_iam_role" "ct_role" {
name = "ct-role"
assume_role_policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Action = "sts:AssumeRole",
Effect = "Allow",
Principal = {
Service = "lambda.amazonaws.com"
}
}
]
})
}

resource "aws_iam_policy" "lambda_execution_policy" {
name = "lambda-execution-policy"
description = "IAM policy for Lambda execution"

policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Action = "lambda:InvokeFunction",
Effect = "Allow",
Resource = aws_lambda_function.ct_function.arn
}
]
})
}

4. Lambda Function Configuration:

We configure an AWS Lambda function named ct-function with specific attributes, including a container image from the ECR repository, timeout, memory size, and the IAM role created earlier.

resource "aws_lambda_function" "ct_function" {
function_name = "ct-function"
timeout = 100 # seconds
image_uri = "${data.aws_ecr_repository.ct_image_repo.repository_url}:latest"
package_type = "Image"
memory_size = 200 # MB
role = aws_iam_role.ct_role.arn
}

5. S3 Bucket Declarations and IAM Policy for S3 Access:

We declare two Amazon S3 buckets: data_bucket and model_registry_bucket. Additionally, we create an IAM policy named s3-access-policy that grants our Lambda function read and write access to these buckets.

resource "aws_s3_bucket" "data_bucket" {
bucket = "data-bucket-simple-ct"
}

resource "aws_s3_bucket" "model_registry_bucket" {
bucket = "registry-bucket-simple-ct"
}

resource "aws_iam_policy" "s3_access_policy" {
name = "s3-access-policy"
description = "IAM policy for S3 access"

policy = jsonencode({
Version = "2012-10-17",
Statement = [
{
Action = ["s3:GetObject", "s3:PutObject"],
Effect = "Allow",
Resource = [
"${aws_s3_bucket.data_bucket.arn}/*",
"${aws_s3_bucket.model_registry_bucket.arn}/*",
],
},
],
})
}

6. IAM Role Policy Attachments:

We attach both the Lambda execution policy and the S3 access policy to the IAM role we defined earlier.

resource "aws_iam_role_policy_attachment" "lambda_execution_attachment" {
policy_arn = aws_iam_policy.lambda_execution_policy.arn
role = aws_iam_role.ct_role.name
}
resource "aws_iam_role_policy_attachment" "s3_access_attachment" {
policy_arn = aws_iam_policy.s3_access_policy.arn
role = aws_iam_role.ct_role.name
}

7. CloudWatch Event and Permission:

We set up a CloudWatch event rule to trigger our Lambda function on a schedule (in this case, every Sunday at midnight UTC). We also grant permission to CloudWatch Events to invoke the Lambda function.

resource "aws_cloudwatch_event_rule" "lambda_schedule" {
name = "lambda-schedule-rule"
description = "Scheduled rule to trigger Lambda function"
schedule_expression = "cron(0 0 ? * SUN *)" # Adjust the cron expression for your desired schedule.
}
resource "aws_lambda_permission" "lambda_cloudwatch_permission" {
statement_id = "AllowExecutionFromCloudWatch"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.ct_function.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.lambda_schedule.arn
}

7. Plan and Deploy:

The final step consists in planning and applying the infrastructure changes. Just use the command below, and you’ll see the magic happening!

terraform plan && terraform apply

Checking on AWS

Now you can go to AWS and see your Lambda function up and running! Don’t forget to put some data on the bucket and adapt the Python code to your use case!

Conclusion

Congratulations! You have just set up your continuous training pipeline and made a new step in your MLOps journey!

The code for this tutorial is in my GitHub.

Hit that follow button to stay updated on future explorations! 🔥🤖 Let’s continue this journey of coding, machine learning, and MLOps.

--

--

Tales Marra

Machine Learning Engineer 🤖 | Writing about machine learning and MLOps in an accessible manner!