Stories by Tales Marra on Medium

Simple MLOps #4: Monitoring

Tales Marra — Sat, 04 Nov 2023 18:02:23 GMT

Building monitoring systems is a crucial step when we’re talking about MLOps. It’s like the saying goes: “if you can’t measure it, you can’t improve it”. Monitoring however implies much more than that. It requires setting up vigilant systems capable of alerting errors to the team as well as capturing key metrics to represent the health of the system.

Some benefits of implementing a monitoring system include:

1. Getting alerts when errors occur on your prediction system: Monitoring for errors in real-time ensures swift response to issues, maintaining service reliability.
2. Checking closely performance and engineering metrics that can highly impact the user experience when interacting with the prediction service: Analyzing key metrics helps optimize performance and user satisfaction.
3. Catching data and concept drifts, which are silent killers of performance: Detecting shifts in data or model concepts is crucial for maintaining prediction accuracy over time.

And many more.

There are many tools available, but as usual, we’ll try to build already something simple yet functional and able to make into production with little adaptations.

Even though monitoring systems should be all over the place, we’ll start implementing one to the most crucial part of the system: the inference. But the concepts and techniques you’ll learn here are easily transcribed to the other parts of our MLOps stack, and I invite you to implement them by yourself as a challenge.

Diving into the architecture

To answer the main functionalities we have discussed earlier, we are going to use mainly Cloudwatch, mostly known as the logging service of AWS, but if you play your cards right, it can be much more than that.

Monitoring System architecture

We’ll use Cloudwatch for:

Log and Metric Collection: CloudWatch’s gonna be our trusty sidekick for rounding up all those logs and metrics from our system. It’s like having a watchful eye over everything that’s happening.
Alerts: CloudWatch will also be our alarm bell. It’ll ping us the moment something fishy goes down in our system. We’re talking instant notifications when things aren’t as they should be.
Dashboard Magic: We’re gonna whip up a slick monitoring dashboard with CloudWatch. This dashboard will give us real-time snapshots and data, helping us make decisions and keep an eye on how our system’s doing.

And to make sure we’re in the loop, we’re bringing in AWS Simple Notification Service (SNS). It’s the messenger of the group, following a PubSub model. In our setup:

Alerts (Publisher): The alert system is the chatterbox that’s gonna shout out about any system hiccups.
Email Addresses (Subscribers): Email addresses are the eager listeners on the topic of system errors; they’ll get all the updates when our system runs into issues.

Build the infrastructure

If you have been following the series so far, you already know what’s coming. If not, just a heads up. We’ll be using Terraform to set up the infrastructure for this system, but don’t worry, we’ll do it step by step.

Note: if you didn’t yet create a log group to your function and gave it permissions to lambda to publish logs to it, refer to the logging section in Issue #3 to set it up.

Setting up the dashboard

We are developing a dashboard to track the inference duration, a critical metric that significantly impacts the user experience of our machine learning-based product. To ensure smooth operations, we plan to compute the average duration within 300-second intervals. However, feel free to customize this interval to suit your specific requirements.

# declare a lambda function that already exists
data "aws_lambda_function" "lambda" {
  function_name = "inference-function"
}

# set up the dashboard
resource "aws_cloudwatch_dashboard" "lambda_dashboard" {
    dashboard_name = "inference_monitoring_dashboard"

    dashboard_body = jsonencode({
        widgets = [
            {
                type = "metric"
                x    = 0
                y    = 0
                width = 12
                height = 6
                properties = {
                    metrics = [
                        ["AWS/Lambda", "Duration", "FunctionName", data.aws_lambda_function.lambda.function_name, { "stat": "Average", "period": 300 }],
                    ],
                    view = "timeSeries",
                    stacked = false,
                    region = "eu-west-3",
                    title = "Lambda Function Duration (ms)"
                }
            }
        ]
    })
}

Creating the alarm

Now, we’ll create a CloudWatch Alarm that monitors the “Errors” metric for a the inference AWS Lambda function. If the “Errors” metric value is greater than or equal to 1 in a 60-second period, the alarm will be triggered, and a notification will be sent to an SNS topic.

# create an alarm for the lambda function errors 
resource "aws_cloudwatch_metric_alarm" "lambda_errors_alarm" {
  alarm_name          = "lambda_errors_alarm"
  comparison_operator = "GreaterThanOrEqualToThreshold"
  evaluation_periods  = "1"
  metric_name         = "Errors"
  namespace           = "AWS/Lambda"
  period              = "60"
  statistic           = "Sum"
  threshold           = "1"
  alarm_description   = "This metric monitors lambda errors"
  alarm_actions       = [aws_sns_topic.sns_topic.arn]
  dimensions = {
    FunctionName = data.aws_lambda_function.lambda.function_name
  }
}

Creating an SNS topic

Now we’ll create the SNS topic, that will be used to forward the alert to the email addresses concerned.

# create an SNS topic to send the alarm to
resource "aws_sns_topic" "sns_topic" {
  name = "lambda_errors_topic"
}

Subscribing the email to the topic

Then we need to subscribe the email to the error topic so that it can receive messages.

# create a subscription to the SNS topic
resource "aws_sns_topic_subscription" "sns_topic_subscription" {
  topic_arn = aws_sns_topic.sns_topic.arn
  protocol  = "email"
  endpoint  = "YOUR_MAIL@MAIL.com"
}

Deploying the infrastructure

The final step consists in planning and applying the infrastructure changes. Just use the command below, and you’ll deploy everything we have set up to AWS!

terraform plan && terraform apply

Checking the results

You can now test the different things we have set up! Modify the code for the inference to raise Exceptions and see if you receive an email.

If you go to the Cloudwatch console, you’ll be able to see both the alarm we have created and the dashboard, and it should look something like this:

Conclusion

Good job! You’ve got yourself a monitoring service ready to go to production! Your new MLOps stack project is now monitored properly!

The code for this tutorial is in my GitHub.

Hit that follow button to stay updated on future explorations! 🔥🤖 Let’s continue this journey of coding, machine learning, and MLOps.

Simple MLOps #3: Inference Pipeline

Tales Marra — Sun, 29 Oct 2023 17:49:14 GMT

In this third article of the Simple MLOps series, we’ll cover the step I imagine most of you are interested in: the inference pipeline. But don’t fool yourself, having the other steps properly done (feature/continuous training pipeline and the model registry) is crucial to ensuring the inference process will go smoothly.

Following our philosophy on the series, we’ll strive for simplicity, while building a system that can be put to production.

Without further due, let’s go through the architecture.

Understanding the inference pipeline components

Our architecture will be again serverless based, and will consist of a Lambda function coupled with an API gateway, so we’ll get an API-based prediction service.

The architecture of the prediction API

The inference function must be able to retrieve the latest model version from the model registry versioning table, and pull the latest model object from the storage. Then it will read the payload it has received through the API and perform the prediction on it. The result will be then forwarded via the API to the requester.

Creating an ECR repository to store Lambda image

Like we did for the training pipeline, the first thing we need to do is to create the ECR repository to store the image for our inference function. It’s recommended that you use a different repository than the one for the training function.

aws ecr create-repository \
    --repository-name inference-image-repo \
    --image-scanning-configuration scanOnPush=true \
    --region region

Setting up .env file

To ensure the security of your credentials, we’ll use .env to load environment variables. Adding to the already set training environment variables the ones related to the inference, so the end file will look like this:

AWS_REGION=(YOUR AWS REGION)
AWS_CT_ECR_REPO=(YOUR ECR REPO)
CT_FUNCTION_NAME=ct-function
AWS_INF_ECR_REPO=(YOUR NEW CREATED REPOSITORY FOR INFERENCE)
INF_FUNCTION_NAME=inference-function

The inference function

We’ll now dive into the code of the inference function itself. As we discussed earlier, we’ll mostly need to implement three things.

Retrieving the latest model tag from the versioning table: We’ll use boto to do it, performing a scan on the table and then ordering the items. Notice that this is not the most efficient way to do it in a large table, but for the sake of simplicity we’ll do it like this. The code that allows us to do it is here:

table_name = 'simple-registry'
# use dynamodb as the service
dynamodb = boto3.resource('dynamodb')
# get a table object
table = dynamodb.Table(table_name)
# perform a scan on it. this retrieves all the rows from the table
response = table.scan()

if 'Items' in response and len(response['Items']) > 0:
    # sort by id in reverse
    tag_value = sorted(response['Items'], key=lambda x: x['id'], reverse=True)[0]['tag']
    print("Latest_tag_value: ", tag_value)
else:
    return {
        'statusCode': 404,
        'body': json.dumps('No models found')
        }

Getting the latest model object from S3: Once we are in possession of the latest model tag, we can simply pull it from S3. The code to do it is the following:

s3 = boto3.client('s3')
model = s3.get_object(Bucket='registry-bucket-simple-ct', Key=f'model_{tag_value}.pkl')
model = pickle.loads(model['Body'].read())

Finally we call the predict method on the data. Notice the importance of using pipelines here. As we saved all the pre-processing transformations we need to apply to the data directly on the model, we don’t need to worry about anything, the model will take care of the pre-processing as well.

payload = json.loads(event['body'])
preds = model.predict(pd.DataFrame([payload]))[0]
return {
    'statusCode': 200,
    'statusCode': 200,
    'body': json.dumps(str(preds))
}

Packaging code and dependencies using Docker

Docker is used to package our code and the required dependencies. We’ll build a docker image that will install our requirements and package the code. Similarly to what we did on the training pipeline, we’ll build, tag and push the docker image to the inference image repository.

FROM public.ecr.aws/lambda/python:3.8

# Install the function's dependencies using file requirements.txt
# from your project folder.

COPY requirements.txt  .
RUN  pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
# Copy function code to /var/task
COPY lambda_handler.py ${LAMBDA_TASK_ROOT}

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "lambda_handler.lambda_handler" ]

Placing the Docker image in the cloud

If you’ve followed my advice and installed Just, you can go ahead and just execute the following:

just build-inf-image
just tag-inf-image
just push-ct-image

Building the Infrastructure

We’ll now use Terraform to build the necessary infrastructure.

Building the function

Similarly to what we did on the training pipeline, we’ll declare a lambda function resource, along with a role to which we’ll add permissions later on.

# Declare the ECR repository you've created previously
data "aws_ecr_repository" "inference_image_repo" {
  name = "inference-image-repo"
}

# create a new role for the lambda function
resource "aws_iam_role" "simple_inference_role" {
  name = "simple-inference-role"

  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = "sts:AssumeRole",
        Effect = "Allow",
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })
}

# declare a lambda function resource
resource "aws_lambda_function" "simple-inference" {
  function_name    = "inference-function"
  role             = aws_iam_role.simple_inference_role.arn
# notice we use the image from the repository for the lambda function
  image_uri     = "${data.aws_ecr_repository.inference_image_repo.repository_url}:latest"
  package_type  = "Image"
  timeout          = 900
  memory_size      = 128
  depends_on = [
    aws_iam_role_policy_attachment.cloudwatch_logs_attachment,
    aws_cloudwatch_log_group.simple_inference_log_group,
  ]
}

Defining the interactions

The inference function will need read permissions both for S3 and DynamoDB in order to get model objects and versions. The way to give those permissions in AWS is to create a policy, and then attach the policy to the role of the resource, in our case, the role of the inference function we have just declared.


# create a policy to read from the dynamodb table
resource "aws_iam_policy" "dynamodb_access_policy_inference" {
  name        = "dynamodb-access-policy-inference"
  description = "IAM policy for DynamoDB access"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = [
          "dynamodb:GetItem",
          "dynamodb:Scan",
          "dynamodb:Query"
        ],
        Effect   = "Allow",
        Resource = data.aws_dynamodb_table.simple_registry.arn,
      },
    ],
  })
}

# attach the policy to the role
resource "aws_iam_role_policy_attachment" "dynamodb_access_attachment" {
  policy_arn = aws_iam_policy.dynamodb_access_policy_inference.arn
  role       = aws_iam_role.simple_inference_role.name
}

# create a policy to read from the s3 bucket
resource "aws_iam_policy" "s3_access_policy_inf" {
  name        = "s3-access-policy-inf"
  description = "IAM policy for S3 access for inference"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action   = ["s3:GetObject"],
        Effect   = "Allow",
        Resource = [
          "${data.aws_s3_bucket.model_registry_bucket.arn}/*",
        ],
      },
    ],
  })
}

# attach the policy to the role
resource "aws_iam_role_policy_attachment" "s3_access_attachment" {
  policy_arn = aws_iam_policy.s3_access_policy_inf.arn
  role       = aws_iam_role.simple_inference_role.name
}

Creating the API

Now we need to set up the API and the methods which will be used to call our function. The API needs therefore the permission to call Lambda on our behalf. The following code does exactly that.

# create an HTTP API gateway for the lambda function
resource "aws_apigatewayv2_api" "simple_inference_api" {
  name          = "simple-inference-api"
  protocol_type = "HTTP"
}

resource "aws_lambda_permission" "apigw_lambda_permission" {
  statement_id  = "AllowAPIGatewayInvoke"
  action        = "lambda:InvokeFunction"
  function_name = aws_lambda_function.simple-inference.function_name
  principal     = "apigateway.amazonaws.com"
  source_arn    = "${aws_apigatewayv2_api.simple_inference_api.execution_arn}/*/*"
}

resource "aws_apigatewayv2_integration" "simple_inference_integration" {
  api_id            = aws_apigatewayv2_api.simple_inference_api.id
  integration_type  = "AWS_PROXY"
  integration_uri   = aws_lambda_function.simple-inference.invoke_arn
  integration_method = "POST"
}

resource "aws_apigatewayv2_stage" "simple_inference_stage" {
  api_id      = aws_apigatewayv2_api.simple_inference_api.id
  name        = "simple-inference-stage"
  auto_deploy = true
}

We’ll now add a new route to the API that will be attached to the function. Notice that we add it on POST, so that we can send a payload along with it.

resource "aws_apigatewayv2_route" "simple_inference_route" {
  api_id    = aws_apigatewayv2_api.simple_inference_api.id
  route_key = "POST /inference"
  target    = "integrations/${aws_apigatewayv2_integration.simple_inference_integration.id}"
}

Logging

We’ll now add Cloudwatch logging to our function, which can be used both for debugging and more advanced monitoring things, something we are going to see in the next article of the series.

# create a cloudwatch log group for the lambda function
resource "aws_cloudwatch_log_group" "simple_inference_log_group" {
  name              = "/aws/lambda/inference-function"
  retention_in_days = 7
}


# attach the policy to the role for CloudWatch Logs
resource "aws_iam_policy" "cloudwatch_logs_policy" {
  name        = "cloudwatch-logs-policy"
  description = "IAM policy for CloudWatch Logs access"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action   = ["logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents"],
        Effect   = "Allow",
        Resource = ["arn:aws:logs:*:*:*"]
      },
    ],
  })
}

# attach the policy to the role
resource "aws_iam_role_policy_attachment" "cloudwatch_logs_attachment" {
  policy_arn = aws_iam_policy.cloudwatch_logs_policy.arn
  role       = aws_iam_role.simple_inference_role.name
}

Deploying the infrastructure

The final step consists in planning and applying the infrastructure changes. Just use the command below, and you’ll deploy everything we have set up to AWS!

terraform plan && terraform apply

Or, just deploy your infra with:

just deploy-inference

Testing the prediction service

Everything looks good! Now let’s test the prediction service we have created! You can go to AWS, take the endpoint of your API and called using cURL or Postman!

You can use this payload:

{
    "age": 19,
    "sex": "female",
    "bmi": 27.9,
    "children": 0,
    "smoker": "yes",
    "region": "southwest"
}

Conclusion

And BAM! You’ve got yourself a prediction pipeline, coupled with a continuous training pipeline and a model registry! This is already a great project to showcase on your portifolio. And we are going to make it even better!

The code for this tutorial is in my GitHub.

Hit that follow button to stay updated on future explorations! 🔥🤖 Let’s continue this journey of coding, machine learning, and MLOps.

Simple MLOps #2: Model Registry

Tales Marra — Sun, 15 Oct 2023 18:11:28 GMT

Continuing our Simple MLOps series, we’ll now explore the model registry.

But what is a model registry?

The model registry is the place where the models live once they’re trained, and can be retrieved at inference time to perform predictions. The main functionalities a model registry must have are:

Capacity to retrieve the latest model: Once the continuous training pipeline has deployed a new model into production, the inference pipeline must be able to automatically recognize and retrieve this new version;
Capacity to rollback: If a version is judged flawed but it’s already in production, the engineering team must be able to remove that version from production and return to a previous state as quickly as possible.
Capacity of packaging metadata with the model including details about training and evaluation metrics, compatible code versions etc.

And again, this doesn’t have to be complicated! Starting simple and building up from that always gets the best results. And that’s what we’ll be doing it today! Implementing a simple version registry that already has those capacities.

Understanding the registry components

From the functionalities we saw earlier, can deduce we are going to need mainly two things:

Object Storage: the place where the model objects are going to be stored. For this we are going to use AWS S3, a managed cloud object storage;
Version and metadata table: this is where the inference pipeline will search for the latest trained model, and where you can also store relevant information about the training_date, the code version compatible with that model, metrics and many more. For this, we’ll use a DynamoDB table.

Our architecture will therefore look like this:

Creating the object storage

Starting on the repository, you can create a new main.tf file.

If you already deployed the continuous-training pipeline we created in the first article of the series, your bucket to store models already exists so no need to re-create one.

Note: if you’re implementing only the registry, you can create your object storage by doing:

resource "aws_s3_bucket" "model_registry_bucket" {
   bucket = "registry-bucket-simple-ct"
}

Creating the versioning table

In our design, the versioning table will also include metadata, however there are other design choices you can choose from:

creating a dedicated table to store metadata;
packaging metadata along with model objects and storing them on object storage;
store the metadata on the object storage separated from the model;

The design choice will mostly be defined by the easiness to request and analyze that data later on by your monitoring/dashboard solutions.

We’ll use AWS DynamoDB, a NoSQL managed database service to do it.

To create the table, we’ll declare a table resource, declaring also the fields along with their types. We also create another index for evaluation metrics.

resource "aws_dynamodb_table" "simple-registry" {
  name     = "simple-registry"
  hash_key = "id"
  range_key = "published_at"
  billing_mode = "PROVISIONED"
  read_capacity = 1
  write_capacity = 1
  attribute {
    name = "id"
    type = "N"
  }
  attribute {
    name = "published_at"
    type = "S"
  }
  attribute {
    name = "tag"
    type = "S"
  }
  attribute {
    name = "evaluation_metrics"
    type = "S"
  }
  global_secondary_index {
    name            = "tag-index"
    hash_key        = "tag"
    range_key       = "evaluation_metrics"
    projection_type = "ALL"
    read_capacity = 1
  write_capacity = 1
  }
}

Plugging into our stack

The registry needs to interact both with the training and the inference pipeline.

At training time, the training pipeline pushes the model object to the object storage, and also writes a new entry to the versioning table pointing to the updated version.

At its turn, the inference pipeline (which we will develop on next article) will read the versioning table to find the latest version of the model and then load it from the object storage to perform inference on newly received data.

Setting training pipeline permissions

In the first article of the series, we already gave permissions for our training pipeline to write to the object storage, so no action needed!

If you didn’t, you can give permissions by attaching a policy to the role of your training pipeline:

data "aws_iam_role" "ct_role" {
  name = "ct-role"
} 

# define a policy for the Lambda function to read and write to the 2 buckets
resource "aws_iam_policy" "s3_access_policy" {
  name        = "s3-access-policy"
  description = "IAM policy for S3 access"
  
  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action   = ["s3:PutObject"],
        Effect   = "Allow",
        Resource = [
          "${aws_s3_bucket.model_registry_bucket.arn}/*",
        ],
      },
    ],
  })
}

# Attach the S3 access policy to the IAM role
resource "aws_iam_role_policy_attachment" "s3_access_attachment" {
  policy_arn = aws_iam_policy.s3_access_policy.arn
  role       = aws_iam_role.ct_role.name
}

All we need to do now is set the permissions for the training pipeline to write to the versioning table, like this.

data "aws_iam_role" "ct_role" {
  name = "ct-role"
} 

# add policy to dynamodb to allow the lambda function to write to it
resource "aws_iam_policy" "dynamodb_access_policy" {
  name        = "dynamodb-access-policy"
  description = "IAM policy for DynamoDB access"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = [
          "dynamodb:PutItem",
          "dynamodb:Scan",
        ],
        Effect   = "Allow",
        Resource = aws_dynamodb_table.simple-registry.arn,
      },
    ],
  })
}

# attach the policy to the lambda execution role
resource "aws_iam_role_policy_attachment" "dynamodb_access_attachment" {
  policy_arn = aws_iam_policy.dynamodb_access_policy.arn
  role       = data.aws_iam_role.ct_role.name
}

Deploying the infrastructure

The final step consists in planning and applying the infrastructure changes. Just use the command below, and you’ll deploy everything we have set up to AWS!

terraform plan && terraform apply

Or if you have become a Just fan and would like to use it more, insert this recipe into your justfile, located outside of the training and registry folders.

# ---------------------------
# Registry Recipes
# ---------------------------
deploy-registry:
    cd simple-registry && terraform plan && terraform apply -auto-approve

Then execute:

just deploy-registry

Updating the training function code to write to the versioning table

Now, all that’s left is to modify the Python code of our Lambda training function to write to the versioning table. Much like the S3 case, we’ll use boto3 but with dynamodb instead of S3.

Just insert the following code right after the uploading of the model to S3:

# after deploying the registry you can write to dynamodb table
    dynamodb = boto3.resource('dynamodb')
    table = dynamodb.Table('simple-registry')
    # insert the item with the fields id, published_at, tag, and evaluation metrics
    table.put_item(Item={
        'id': int(pd.Timestamp.now().timestamp()), 
        'published_at': pd.Timestamp.now().isoformat(),
        'tag': model_sha,
        'metrics': json.dumps({
            'f1_score': f1
        })})

⚠️ Don’t forget to re-build, tag, push the new image of your continuous training pipeline to ECR and updating the lambda with the new image using the Just commands we saw previously.

Checking on AWS

Now you can go to AWS and re-run your lambda function. Once it’s completed, check out the DynamoDB table to check the new entries!

Conclusion

Congratulations! You have just set up your model registry and made a big new step in your MLOps journey!

The code for this tutorial is in my GitHub.

Hit that follow button to stay updated on future explorations! 🔥🤖 Let’s continue this journey of coding, machine learning, and MLOps.

Simple MLOps #1: Continuous training pipeline

Tales Marra — Sat, 07 Oct 2023 16:47:30 GMT

We all know the importance of re-training a model. It ensures that your model stays up to date with your data, and keeps the good performance through its life cycle.

However, it doesn’t have to be a complex process! In this first article of the Simple MLOps Series, you’ll learn how to simply implement a continuous training pipeline!

Understanding Continuous Training Components

The basic components you need are a trigger, a pipeline for data processing and training, evaluation and deployment, and a registry.

Trigger: The trigger is your MLOps ignition switch. Depending on your strategy, opt for a scheduler or a database check to determine if it's time to launch training. This decision hinges on the volume of new data received, ensuring efficient resource allocation.
Data Processing and Training Pipeline: The core of continuous training resides here. You can choose between a container-based or serverless compute resource for the training process. This step yields your model and triggers the subsequent phase.
Evaluation and Deployment Pipeline: This pipeline can share the compute resource chosen for the previous step. It's imperative to employ the same environment as your inference. Here, evaluation spans infrastructure, business metrics, and classic model performance metrics.
The Registry: The safe haven for your models, the registry is where you store the latest iterations and maintain older versions as backups

In this article, we will guide you on how to implement these components using popular frameworks such as Terraform, aws-cli, and Docker.

Setting up credentials and tools

Before diving into the implementation, it’s essential to install aws-cli and set up the necessary credentials. Additionally, you need to install Terraform, a tool that provides a consistent CLI workflow for managing and provisioning infrastructure.

Creating an ECR repository to store Lambda image

Once you have everything installed and permissions set up, you can create an ECR repository where the image for the lambda function will be stored. This can be done by running a specific command in aws-cli.

aws ecr create-repository \
    --repository-name ct-image-repo \
    --image-scanning-configuration scanOnPush=true \
    --region region

Setting up .env file

To ensure the security of your credentials, we’ll use .env to load environment variables. This is where you’ll store your AWS region, ECR repo, and function name.

AWS_REGION=(YOUR AWS REGION)
AWS_CT_ECR_REPO=(YOUR ECR REPO)
FUNCTION_NAME=ct-function

The training function

The Lambda function will handle both the preprocessing of the data and the training of the model. For our example, we’ll train a model to predict if the insurance charges of someone will exceed 10k. The code for this can be found on GitHub.

Packaging code and dependencies using Docker

Docker is used to package our code and the required dependencies. We’ll build a docker image that will install our requirements and package the code.

FROM public.ecr.aws/lambda/python:3.8

# Install the function's dependencies using file requirements.txt
# from your project folder.

COPY requirements.txt  .
RUN  pip3 install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
# Copy function code to /var/task
COPY lambda_handler.py ${LAMBDA_TASK_ROOT}

# Set the CMD to your handler (could also be done as a parameter override outside of the Dockerfile)
CMD [ "lambda_handler.lambda_handler" ]

Placing the Docker image in the cloud

To push your image to the ECR repository, you’ll have to build, tag and push it. You can use your Just file recipes to make your life easier if you’re already familiar with Docker. If not, I invite you to go to the Just file and familiarize yourself with them. Once you’ve done it, just run:

just build-ct-image
just tag-ct-image
just push-ct-image

Deploying Infrastructure

The rest of the infrastructure will be deployed using Terraform. Terraform is an open-source infrastructure as code software tool that provides a consistent CLI workflow for managing and provisioning infrastructure. The main.tf file is the default filename for a file which defines what Terraform will do. Let’s break it down in parts!

AWS Provider Configuration:

The Terraform script begins by configuring the AWS provider. Here, we specify the AWS region as eu-west-3, indicating that all resources will be created in the European Union (Paris) region.

hcl
provider "aws" {
 region = "eu-west-3"
}

2. ECR Repository Declaration:

Next, we declare an Amazon Elastic Container Registry (ECR) repository named ct-image-repo using the aws_ecr_repository data block. This repository will store our container images.

data "aws_ecr_repository" "ct_image_repo" {
 name = "ct-image-repo"
}

3. IAM Role and Policy for Lambda Execution:

To grant the Lambda function execution permissions, we define an IAM role named ct-role with a trust policy that allows AWS Lambda to assume this role. We also create a policy named lambda-execution-policy that permits Lambda to invoke functions.

resource "aws_iam_role" "ct_role" {
  name = "ct-role"
  assume_role_policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action = "sts:AssumeRole",
        Effect = "Allow",
        Principal = {
          Service = "lambda.amazonaws.com"
        }
      }
    ]
  })
}

resource "aws_iam_policy" "lambda_execution_policy" {
  name        = "lambda-execution-policy"
  description = "IAM policy for Lambda execution"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action   = "lambda:InvokeFunction",
        Effect   = "Allow",
        Resource = aws_lambda_function.ct_function.arn
      }
    ]
  })
}

4. Lambda Function Configuration:

We configure an AWS Lambda function named ct-function with specific attributes, including a container image from the ECR repository, timeout, memory size, and the IAM role created earlier.

resource "aws_lambda_function" "ct_function" {
 function_name = "ct-function"
 timeout = 100 # seconds
 image_uri = "${data.aws_ecr_repository.ct_image_repo.repository_url}:latest"
 package_type = "Image"
 memory_size = 200 # MB
 role = aws_iam_role.ct_role.arn
}

5. S3 Bucket Declarations and IAM Policy for S3 Access:

We declare two Amazon S3 buckets: data_bucket and model_registry_bucket. Additionally, we create an IAM policy named s3-access-policy that grants our Lambda function read and write access to these buckets.

resource "aws_s3_bucket" "data_bucket" {
  bucket = "data-bucket-simple-ct"
}

resource "aws_s3_bucket" "model_registry_bucket" {
  bucket = "registry-bucket-simple-ct"
}

resource "aws_iam_policy" "s3_access_policy" {
  name        = "s3-access-policy"
  description = "IAM policy for S3 access"

  policy = jsonencode({
    Version = "2012-10-17",
    Statement = [
      {
        Action   = ["s3:GetObject", "s3:PutObject"],
        Effect   = "Allow",
        Resource = [
          "${aws_s3_bucket.data_bucket.arn}/*",
          "${aws_s3_bucket.model_registry_bucket.arn}/*",
        ],
      },
    ],
  })
}

6. IAM Role Policy Attachments:

We attach both the Lambda execution policy and the S3 access policy to the IAM role we defined earlier.

resource "aws_iam_role_policy_attachment" "lambda_execution_attachment" {
 policy_arn = aws_iam_policy.lambda_execution_policy.arn
 role = aws_iam_role.ct_role.name
}
resource "aws_iam_role_policy_attachment" "s3_access_attachment" {
 policy_arn = aws_iam_policy.s3_access_policy.arn
 role = aws_iam_role.ct_role.name
}

7. CloudWatch Event and Permission:

We set up a CloudWatch event rule to trigger our Lambda function on a schedule (in this case, every Sunday at midnight UTC). We also grant permission to CloudWatch Events to invoke the Lambda function.

resource "aws_cloudwatch_event_rule" "lambda_schedule" {
 name = "lambda-schedule-rule"
 description = "Scheduled rule to trigger Lambda function"
 schedule_expression = "cron(0 0 ? * SUN *)" # Adjust the cron expression for your desired schedule.
}
resource "aws_lambda_permission" "lambda_cloudwatch_permission" {
 statement_id = "AllowExecutionFromCloudWatch"
 action = "lambda:InvokeFunction"
 function_name = aws_lambda_function.ct_function.function_name
 principal = "events.amazonaws.com"
 source_arn = aws_cloudwatch_event_rule.lambda_schedule.arn
}

7. Plan and Deploy:

The final step consists in planning and applying the infrastructure changes. Just use the command below, and you’ll see the magic happening!

terraform plan && terraform apply

Checking on AWS

Now you can go to AWS and see your Lambda function up and running! Don’t forget to put some data on the bucket and adapt the Python code to your use case!

Conclusion

Congratulations! You have just set up your continuous training pipeline and made a new step in your MLOps journey!

The code for this tutorial is in my GitHub.

Hit that follow button to stay updated on future explorations! 🔥🤖 Let’s continue this journey of coding, machine learning, and MLOps.

Create Your Own Data Analysis Assistant with ChatGPT

Tales Marra — Sun, 03 Sep 2023 17:01:20 GMT

Photo by bady abbas on Unsplash

Large Language Models (LLMs) are currently taking center stage in the AI realm. The versatility they offer in application development is unparalleled.

One intriguing use-case? Crafting a personal assistant tailored for data analysis. In this guide, we’ll explore how to build an application that lets you upload a CSV file and inquire about its content, using everyday language. Let’s dive in!

Getting Started with Dependencies

Setting up Poetry
Poetry is our go-to for managing project dependencies. It allows you to install, remove, package your dependencies quite easily! If you haven’t got it on your system yet, simply refer to the official installation documentation.

Integrating Required Libraries
Once Poetry is installed, you’ll first need to initiate it on the repository. This will create a .toml file for your project where your dependencies will later on appear.

You can do it by:

poetry init

incorporate the necessary libraries with:

poetry add openai pandas streamlit langchain python-dotenv tabulate

API Key Configuration

To interact with OpenAI’s API, first sign up for an account on their platform. Once registered, navigate to your user profile and generate a new API key within the account settings.
For security, store your OPENAI_API_KEY in a .env file and retrieve it within your application.
⚠️ Important: If you plan to commit your project to a repository, make sure to add your `.env` file to the `.gitignore` to keep your API key confidential.

from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())
if not os.environ.get('OPENAI_API_KEY'):
 raise ValueError("Ensure your API key is set!")

An Overview of LangChain

LangChain Library logo. Source: LangChain

What makes LangChain stand out? It furnishes developers with the flexibility to tap into the capabilities of LLMs.

LangChain is a powerful library conceived for users to interact with LLMs. Combining prompt chaining, it allows you to built applications that are:

data-aware: connecting the LLM with other sources of data such as APIs, datasets and many more;
agentic: allow a language model to interact with its environment, performing for instance data manipulation to achieve some result.

For our project, we’ll exploit the Agent abstraction. Agents utilize language models as a cognitive mechanism, determining the sequence and nature of actions. They function in a cyclic mode across four phases until they hit upon a solution:

Thought: Articulate a goal.
Action/Input: Execute an action aligned with the goal, in our scenario, it involves manipulating the data through Pandas to retrieve an answer.
Observation: Evaluate the outcome and go through the loop again if unsatisfactory.

You can engage with your agent by:

from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI
pd_agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)
pd_agent.run("YOUR DATA QUERY")

Here we leverage LangChain capabilities to import an external data source (the .csv file) into the LLM environment, making it available as the source of information.

If you execute that code, you’ll notice the cyclic phases we described earlier.

Crafting the User Interface with Streamlit

A seamless interface is a crucial factor in enhancing the user experience of any software application. Streamlit is a Python library specifically designed for creating interactive and data-driven web applications with minimal effort.

We’ll be mainly using the following components:

file_uploader: allows the user to input their CSV for analysis;
input_text: allows the user to input the question;
write: allows the system to show the output to the user;

Here’s a snapshot of what our app.py will look like:

import streamlit as st
import pandas as pd
from langchain.agents import create_pandas_dataframe_agent
from langchain.llms import OpenAI
from dotenv import load_dotenv, find_dotenv
import os
_ = load_dotenv(find_dotenv())
if not os.environ.get('OPENAI_API_KEY'):
 st.error("Add OPENAI_API_KEY to your .env file")
 st.stop()
openai.api_key = os.environ['OPENAI_API_KEY']
# Launching the application
st.title('Your AI-Powered Data Analyst')
uploaded_file = st.file_uploader("Upload your CSV")
if uploaded_file:
 df = pd.read_csv(uploaded_file)
 st.write("Preview of your dataset:")
 st.write(df.head(2))
st.write("Pose your query:")
 question = st.text_input("Query:")
 if question:
 pd_agent = create_pandas_dataframe_agent(OpenAI(temperature=0), df, verbose=True)
 with st.spinner('Crunching the numbers…'):
 response = pd_agent.run(question)
 st.write(response)

Automate with Makefile

What’s a Makefile? It’s a blueprint that details the sequence of instructions for compiling and constructing software projects.

A Makefile is a text file used in software development to define a set of instructions, called rules, that describe how a program or a set of files should be built, compiled, and linked. The purpose of a Makefile is to automate the compilation and building process of software projects by specifying the dependencies and commands necessary to produce the final executable or output.

It really helps out summarizing commands and not having to remember large commands over and over again.

Here’s a basic Makefile for our application:

run:
 poetry install & poetry run streamlit run app.py

To kick things off, simply enter make run in your terminal and see the magic happening!

Here’s how the app looks like

Kudos! 🚀 You’ve now developed an application harnessing the might of large language models. The whole code for this app is available on Github.

Hit that follow button to stay updated on future explorations! 🔥🤖 Let’s continue this journey of coding, machine learning, and MLOps.

So, what is actually MLOps?

Tales Marra — Thu, 22 Sep 2022 20:06:59 GMT

According to Introducing MLOps:

“MLOps is the standardization and streamlining of machine learning life cycle management.”

This is a particularly interesting definition, because breaking it apart allows us to find some key concepts in understanding what it implies in practice:

1. Standardization: Machine learning code should be tested automatically and frequently! Who says automated tests is saying of course CI!

2. Streamlining: Model deployment, monitoring and replaced should also be automated! In here we find the famous CD!

Those will be developed in further posts, so stay tuned!

Now let’s discuss the machine learning lifecycle.

The Machine Learning Lifecycle is composed mainly of the following steps:

1) Business 🧠:
The Machine learning lifecycle starts with a business problem, that we’ll try to solve using machine learning. This step has to usually convert the business problem to some metric, KPI, etc. that we’re able to optimize using an ML algorithm;

2) Data Preparation 📝:
Data is usually stored somewhere in the cloud (BigQuery, S3, Cloud Storage, …) but is often set in an unsuited manner to the problem in question. We should therefore be able to produce a pipeline (standard set of operations) that is capable of processing this “scrambled” data and output something containing the information to answer the business problem.

3) Research 🔍:
At the end of data preparation, we can start both analyzing the dataset and experimenting with models to solve the business problem. The interest here is to produce a model/prototype, not something ready to production.

4) Deployment 🚀:
If we’re satisfied with our model metrics, we can move on to the deployment. Here we can include clean code, automated testing and also automated deployment to production.

5) Monitoring 🚨:
Once model is at serving phase, we need to monitor it to ensure timing and performance quality to the users. It’s important to keep in mind that once model is in production, the job is NOT done. Ensuring proper functioning and performance quality is a continual job.

How is it for you?
What does MLOps means to you and to the company you are at?
Let me know in the comments! Don’ forget to like and share if you liked it!

Let’s talk about tests!

Tales Marra — Thu, 22 Sep 2022 20:02:37 GMT

Testing workflow

Testing is one of the most important parts in software engineering. It ensures the general functioning of components, as well as the quality of the integration of new components and code to the existing parts.

Different types of testing

The main different types of testing are:

Unit tests: those constitute the first level of testing, the closest one to the code itself. It ensures that a function is doing what it’s supposed to and the different cases where bugs can happen (for instance, wrong input types etc.)

Integration tests: those test the integration between functions, modules etc. Here is where the interface between components is going to be first tested.

System testing / End to End testing: Aims to test the system as a whole, mocking inputs and testing the general produced result is correct. This is usually done by deploying the application on a staging environment and checking that everything is working over the application.

User Acceptance Testing: Those are the “user” performed tests. A user, or something emulating user behavior will validate many business use cases over the platform (still in a protected environment), and if everything is ok, the code is then deployed to production.

There are many different types of testing that have different purposes; while not mandatory, I’ll advise you to implement and automate all of them. They can (and will) be your inspectors, catching mistakes that can cost you, your team and your company a lot of time and money if shipped to production.

If you liked this reading, please like, comment and share!

Diving into CI

Tales Marra — Thu, 22 Sep 2022 19:54:43 GMT

Before we can start developing data processing pipelines, models and applications, we need the proper base to do it, otherwise it’s going to get VERY complicated VERY soon!

CI/CD are important software engineering concepts, and they are really at the base and core of any software development project, even more if scaling is an important factor! That’s why it’s important to understand and apply those concepts in everyday life as a software engineer.

Let’s begin with CI.

CI stands for continuous integration.

Integration? To what?

That’s a good question. If you are just starting to program, you probably have not done a lot of collaboration. Your local environment or Jupyter notebook is enough for you to go on.

However, when you start being one of the many engineers collaborating on a single project, things start to get tricky and many questions can appear like:

How do I integrate my contribution to the common project?
How do I know if my contribution has conflicts with the ones from someone else? (I modified a function that someone else erased it)

and many more.

That’s where the common code base comes in handy.

Common code base where devs can contribute together

It’s a place where the source of truth of the code is going to be stored and developers can contribute by branching over the main line. Contributions can be analyzed by peers and everyone can be up-to-date. Some examples of this are Github, Gitlab etc.

If you are not yet familiar with those, I recommend you start using them as soon as possible.

And how do we integrate?

Using tests of course, and lots of it!

I’ll write a dedicated post later only about tests, since it’s such an important subject that it deserves special treatment. But the idea is that the various types of tests (unit tests, integration tests, …) should be run at this common code base for each contribution and each merge to the main branch in all environments, to make sure that there are no errors or unexpected problems not caught by the team during the development process.

And of course all tests should run automatically!!

If you are wondering how, stay tuned! A practical tutorial is coming soon!

The SQL Cheat Sheet

Tales Marra — Wed, 27 May 2020 12:09:45 GMT

Nowadays, when larger and larger databases are being produced and used by companies, SQL becomes more needed than ever. And not all of us have the time to take on courses and courses to learn about that, even tough it’s the recommended path. So, to help you out with that, I have decided to launch this little guide so that you can know the basics and therefore can already start on using it.

This is a quick and brief summary of the main aspects of SQL, and how to use them in simple examples.

Note: In here I am using SQL Lite, so the keywords may differ a little but the ideas are the same.

Part 1: Main Keywords

SELECT … FROM : The keywords for selections

The main idea that you have to have in mind for this one is that you are literally selecting something from a table.

For instance, on this table named costumer:

Customer table

SELECT name FROM customer;

Will give:

Selected names of costumers

Additional Comments: the keyword distinct allows you to see only unique results.

WHERE : How to specify conditions

To specify a condition that a record must match in order to be selected, you can use the WHERE keyword followed by the condition to match.

For example:

SELECT * FROM customer 
WHERE name LIKE 'Bill%';

This statement will select from my table only rows where Bill is in the name.

Additional comments: The keyword LIKE is to search for patterns in your condition statement, and the % sign is basically an ignore what’s after as long as it has Bill in its beginning. If you want to select for example only some last name, you could have used … ‘ % last_name’.

GROUP BY: How to aggregate data

Let’s say that you have some sale table, like this one:

Sale table

But you, as a manager, would like to know how many units were sold in total and how much was the revenue of that day. This is where the GROUP BY keyword comes in, as it allows you to aggregate your data.

By doing,

SELECT date,sum(quantity) AS amount ,sum(price) AS total_amount 
FROM sale 
GROUP BY date;

You’ll do exactly that. Gathering your table by the dates, summing the columns of price, and quantity, giving them new names with AS and returning the result that will look like this.

Result for group by statement

HAVING : The WHERE for aggregated data

The HAVING keyword will work in a similar way as the WHERE one, but for the aggregated data. It will filter the table based on some condition.

For example:

SELECT date,sum(quantity) AS amount , sum(price) AS total_amount FROM sale 
GROUP BY date 
HAVING total_amount>5000;

The same statement as before but adding just a condition over the summed prices will return as result only the second record (row), where the condition is met.

There is also the concept of JOIN, an important one, but this one requires more explanation and if you are interested, let me know and I can write a cheat sheet only for that.

ORDER BY: Ordering the results you get

This keyword is a very simple one and allows you to basically order the results you got by some field, and in ASC (ascending) or DESC (descending) order.

For example, if I had a table of people, I could just run the following statement:

SELECT * from people 
ORDER BY first_name ASC ,last_name ASC;

And as a result I would have the names of this table in alphabetical order.

Part 2 : Creation and Destruction

Creating and Destroying a table

The commands to do that are very simple and intuitive. The are basically:

CREATE TABLE name ( id INTEGER, description TEXT );

DROP TABLE IF EXISTS name;

Additional notes: In the CREATE TABLE, you can specify that a column (field) is a PRIMARY KEY and this column will be automatically filled with sequential unique values. There are also some tricks such as UNIQUE, that you can put on your features to not accept repetition of values, or NOT NULL, to not accept missing data. In the DROP TABLE one, I recommend you always put IF EXISTS keyword, otherwise if there is no such table you will get an error.

Inserting and Deleting data

This is also very simple and intuitive. To do so:

INSERT INTO ( )
VALUES ( );

The main trick here is that you are not obliged to fill in all data to insert a new row.

To delete,

DELETE FROM customer 
WHERE ;

Additional Note: You can also update your table using the UPDATE and SET keywords.

For instance,

UPDATE people 
SET first_name='Bob' 
WHERE first_name='Aaron' AND last_name= 'Davis';

This will change the first name of Aaron Davis to Bob.

Part 3 : Interesting Stuff

Transactions

You should always try to execute your sequences of statements (commands) in transactions, there are basically an encapsulation of those. This will really enhance performance and allow a better management of eventual conflicts.

BEGIN TRANSACTION;
SQL STATEMENTS
END TRANSACTION;

Triggers : Automatic Updates and Verification

Triggers are a way to automatically do some statements before of after something is done on a table. Two interesting examples of that are updating related tables after some insertion has been done, or verifying that the operation can be executed and if not, ROLLBACK (return the database to its previous state).

CREATE TRIGGER  
AFTER INSERT ON

UPDATES IN TABLES
END

This is an example of a trigger being set after some insertion on a table happens, in order to update related tables.

Views : Saving and reusing results of queries

Views are a way to store manipulations that you did on your database to reuse in other statements without altering the original tables, which is something quite interesting to maintain the integrity of your database. To do so, you just put the following before your query and the result will be stored as a view.

CREATE VIEW  
AS SQL STATEMENTS

This is possible as the result of every SQL query is a table.

For now that is it, hope you all enjoy and make use of this when you need a quick reminder or as an initial step.

Hope you all have enjoyed, and if you have any comments or suggestions, I am happy to hear!

About Success

Tales Marra — Sat, 25 Apr 2020 09:50:35 GMT

https://medium.com/media/4137837f8a521af711a3a892bc9ed279/href

“Would you tell me, please, which way I ought to go from here?”
“That depends a good deal on where you want to get to,” said the Cat.
“I don’t much care where — ” said Alice.
“Then it doesn’t matter which way you go,” said the Cat.
“ — so long as I get somewhere,” Alice added as an explanation.
“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”

The world nowadays is probably the one with the greatest number of possibilities for a human being to do as a career. From scientist to youtuber to digital influencer, the Internet opens up whole new paths to follow.

As much as this is exciting, it is also an issue. To my generation, the two main problems are summarized in that little extract above.

The number of paths is too great, the number of roads is too high, and the details and the potential of each choice are given and updated to us in seconds. We are a Google search away from discovering the richest and most influential people in whatever path we decide to follow. We can have the finest quality measurements about any path we are interested about. When you sum up to that the opinions of family, friends and loved ones… I think you can picture the situation.

And that makes us all in a sense like Alice, with no certain path to go. The famous paradox of choice is right in front of us, and we are like Buridan’s ass.

Buridan’s ass

“A paradoxical situation wherein an ass, placed exactly in the middle between two stacks of hay of equal size and quality, will starve to death since it cannot make any rational decision to start eating one rather than the other. The paradox is named after the 14th century French philosopher Jean Buridan.”

So when we ask from life which is the best way to go, it answers us exactly like the cat:

“That depends a good deal on where you want to get to”

Therefore, as a young Alice making her path through the world, the first thing we should ask ourselves is what is success to me.

And the answer may surprise you.

You might discover that success to you is being the CEO of a great multinational, or coming home at night and having your children waiting for you at the door. Or even traveling the world with nothing but a backpack on your back, deciding the next country the day before going.

And they can all be Wonderland to you as long as you are being true to yourself (another great lesson of the story of Alice, as she can only rescue Wonderland from the red queen once she figures out who she is).

I once met an old man in Paris. He had quit a great job as a doctor for an early retirement, sold everything he had and started traveling the world with nothing except his body outfit. Today he is already in his third world tour and his only regret is not having started it before;

So don’t get stuck with the definition of success of others.

Define who you are and what that world means to you. And then the path will open up clearly.

The second problem is seen in the second part of that extract. The anxious Alice rushes to say to the cat:

so long as I get somewhere.

She wants to get somewhere and she wants it now. And that is us, most of the time wanting the destination, especially when we can see people who have already arrived there waving at us by videos, or by Instagram posts.

But as we look to the destination, we forget about the path. All the long nights, all the failure and frustration that those who are there now had to go through in order to arrive. So when we start to walk the road, and things seem no to go as well as expected we begin to doubt ourselves completely and the thought about quitting comes to mind.

The precious lesson of this second part is summarized in the answer she gets from the cat:

“Oh, you’re sure to do that,” said the Cat, “if you only walk long enough.”

If you only walk long enough. Don’t forget, great things take time. You can look at any great invention, anything that someone built and that you admire. The great composers, CEO’s, athletes and people who are at the top of their respective fields had to go though a long path of studies and apprenticeship in order to become who they are. Also, they were all lost Alice once.

So don’t let the struggles of the path discourage you. Keep walking.

The path

Once you have decided your path, have accepted the fact that great things take time and start walking the path, Wonderland will appear to you.