The Data Scientists’ Guide to AWS SageMaker: Key Concepts and Best Practices

Understand what AWS SageMaker is, key concepts to know as a data scientist, and best practices on how to use AWS SageMaker

Karun Thankachan
CodeX
20 min readMay 10, 2023

--

Photo by Andrea De Santis on Unsplash

Are you tired of spending hours wrestling with your data to build and deploy machine learning models? Well, have no fear, AWS Sagemaker is here! In a nutshell, AWS Sagemaker is a cloud-based service that allows you to build, train, and deploy machine learning models at scale. You don’t have to worry about managing servers, configuring environments, or even writing boilerplate code. Sagemaker automates all that for you, so you can focus on the fun stuff — like analyzing data and building cool models.

But that’s not all — Sagemaker also provides a variety of built-in algorithms, frameworks, and tools that make it easy to get started with machine learning.

Overall, AWS Sagemaker is a powerful and user-friendly tool that can help data scientists of all levels build and deploy machine learning models with ease. So let's dive in and see how to use AWS SageMaker as a Data Scientist.

Setting up Environment

First, lets set up our environment with AWS to try out Sagemaker —

Step 1: Create an IAM role To use AWS Sagemaker
You need to create an IAM role that grants the necessary permissions to access AWS resources, such as S3 buckets and EC2 instances. Here’s an example code snippet to create an IAM role:

import boto3

iam_client = boto3.client('iam')

assume_role_policy_document = {
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "sagemaker.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}

create_role_response = iam_client.create_role(
RoleName='Sagemaker-Role',
AssumeRolePolicyDocument=json.dumps(assume_role_policy_document)
)

iam_client.attach_role_policy(
PolicyArn='arn:aws:iam::aws:policy/AmazonSageMakerFullAccess',
RoleName='Sagemaker-Role'
)

sagemaker_role_arn = create_role_response['Role']['Arn']

In this example, we’re creating a new IAM role named “Sagemaker-Role” and attaching the “AmazonSageMakerFullAccess” policy to it, which grants full access to Sagemaker resources.

Step 2: Set up S3 buckets
AWS Sagemaker uses S3 buckets to store data and model artifacts. You’ll need to create an S3 bucket for storing your data and another one for storing your model artifacts. Here’s an example code snippet to create an S3 bucket:

import boto3
s3_client = boto3.client('s3')
s3_bucket_name = 'my-s3-bucket'
s3_client.create_bucket(Bucket=s3_bucket_name)

In this example, we’re creating a new S3 bucket named “my-s3-bucket”. You can customize the bucket name to suit your needs.

Step 3: Configure the Sagemaker instance
Finally, you’ll need to configure the Sagemaker instance itself. You can do this using the Sagemaker API or the AWS console. Here’s an example code snippet to create a new Sagemaker instance using the Python boto3 API:

import boto3
sagemaker_client = boto3.client('sagemaker')
sagemaker_instance_type = 'ml.m5.xlarge'
sagemaker_instance_count = 1
sagemaker_response = sagemaker_client.create_notebook_instance(
NotebookInstanceName='Sagemaker-Instance',
InstanceType=sagemaker_instance_type,
RoleArn=sagemaker_role_arn,
VolumeSizeInGB=5,
DirectInternetAccess='Enabled',
SecurityGroupIds=[sg_id],
SubnetId=subnet_id,
Tags=[{'Key': 'Name', 'Value': 'Sagemaker-Instance'}]
)
sagemaker_instance_url = sagemaker_response['NotebookInstance']['Url']

In this example, we’re creating a new Sagemaker instance with the instance type “ml.m5.xlarge” and a single instance count. You can customize the instance type and count to suit your needs. We’re also specifying the IAM role we created earlier and the S3 bucket we created for storing data and artifacts.

And that’s it! With these steps, you should now have a fully configured AWS Sagemaker environment ready to use for building and deploying your machine learning models

Data preparation

Once we have the environment, then in a typical data science workflow we will work on preparing the data and ingesting it into S3.

Step 1: Load data into S3 buckets
The first step in preparing your data for Sagemaker is to load it into S3 buckets. You can do this using the Python boto3 API or the AWS console. Here’s an example code snippet to upload a CSV file to an S3 bucket using Python boto3:

import boto3

s3_client = boto3.client('s3')
bucket_name = 'my-s3-bucket'
file_name = 'my-data.csv'
object_key = 'data/my-data.csv'
s3_client.upload_file(file_name, bucket_name, object_key)

In this example, we’re uploading a CSV file named “my-data.csv” to an S3 bucket named “my-s3-bucket” and storing it in the “data” folder with the object key “data/my-data.csv”. You can customize the bucket name, file name, and object key to suit your needs.

Step 2: Clean and format the data
Next, you’ll need to clean and format the data to make it suitable for use in Sagemaker. This might involve removing missing values, encoding categorical variables, scaling numerical variables, and so on.

Here’s an example code snippet to load a CSV file from an S3 bucket into a Pandas DataFrame and perform some basic cleaning:

import boto3
import pandas as pd

s3_client = boto3.client('s3')
bucket_name = 'my-s3-bucket'
object_key = 'data/my-data.csv'
response = s3_client.get_object(Bucket=bucket_name, Key=object_key)
data = pd.read_csv(response['Body'])
data = data.dropna()
data = pd.get_dummies(data, columns=['category'])

In this example, we’re using the Python pandas library to load the CSV file from the S3 bucket into a DataFrame. We’re then dropping any rows that contain missing values using the dropna method and encoding the categorical variable "category" using the get_dummies method.

Step 3: Split the data into training and validation sets
Finally, you’ll need to split the data into training and validation sets. This is important to ensure that your machine learning model doesn’t overfit the training data and performs well on new, unseen data. Here’s an example code snippet to split the data into a 80/20 training/validation split:

from sklearn.model_selection import train_test_split

X = data.drop(columns=['target'])
y = data['target']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

In this example, we’re using the Python scikit-learn library to split the data into an 80/20 training/validation split. We’re first separating the features (X) from the target variable (y), and then using the train_test_split function to split the data into training and validation sets.

And that’s it! With these steps, you should now have cleaned, formatted, and split your data into training and validation sets ready for use in Sagemaker.

Building Machine Learning Models

Once the data is prepared and assuming we have already done our EDA and know what model we want to build, then we can —

Step 1: Select an algorithm
The first step in building a machine learning model with Sagemaker is to select an algorithm that suits your problem. Sagemaker provides a wide range of built-in algorithms for common machine learning tasks, such as regression, classification, clustering, and recommendation systems. You can also bring your own algorithm by creating a custom Docker container that implements the Sagemaker API.

Here’s an example code snippet to create an estimator object for the built-in Linear Learner algorithm using Python boto3:

import boto3
from sagemaker import Estimator

s3_input_train = 's3://my-s3-bucket/data/train/'
s3_input_val = 's3://my-s3-bucket/data/val/'
s3_output = 's3://my-s3-bucket/output/'
linear_learner = Estimator(
image_name='linear-learner',
role='my-sagemaker-role',
train_instance_count=1,
train_instance_type='ml.c4.xlarge',
output_path=s3_output,
sagemaker_session=boto3.Session()
)
linear_learner.set_hyperparameters(
feature_dim=10,
predictor_type='binary_classifier',
mini_batch_size=100
)
linear_learner.fit({
'train': s3_input_train,
'validation': s3_input_val
})

In this example, we’re creating an estimator object for the Linear Learner algorithm with some basic hyperparameters, such as the feature dimension, predictor type, and mini-batch size. We’re also specifying the S3 inputs for the training and validation data, as well as the output path for the model artifacts.

Step 2: Configure hyperparameters
The next step is to configure the hyperparameters for your algorithm. Hyperparameters are adjustable settings that control the training process and can significantly affect the performance of your model. Sagemaker provides a convenient way to specify hyperparameters as key-value pairs in the set_hyperparameters method of the estimator object.

Here’s an example code snippet to set hyperparameters for the built-in XGBoost algorithm using Python boto3:

import boto3
from sagemaker import Estimator

s3_input_train = 's3://my-s3-bucket/data/train/'
s3_input_val = 's3://my-s3-bucket/data/val/'
s3_output = 's3://my-s3-bucket/output/'
xgboost = Estimator(
image_name='xgboost',
role='my-sagemaker-role',
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
output_path=s3_output,
sagemaker_session=boto3.Session()
)

# configure hyperparameters
xgboost.set_hyperparameters(
objective='binary:logistic',
max_depth=5,
eta=0.2,
gamma=4,
subsample=0.8,
num_round=100
)

In this example, we’re setting hyperparameters for the XGBoost algorithm, such as the objective function, maximum tree depth, learning rate, and so on. These hyperparameters can have a significant impact on the performance of the model and may need to be tuned through experimentation.

Step 3: Train the model
The final step is to train the model using the configured hyperparameters and input data. You can do this by calling the fit method of the estimator object and passing in the input data as a dictionary of S3

Note: Models can also be streamed in and trained, however that is beyond the scope of this post.

import boto3
from sagemaker import Estimator

s3_input_train = 's3://my-s3-bucket/data/train/'
s3_input_val = 's3://my-s3-bucket/data/val/'
s3_output = 's3://my-s3-bucket/output/'
xgboost = Estimator(
image_name='xgboost',
role='my-sagemaker-role',
train_instance_count=1,
train_instance_type='ml.m4.xlarge',
output_path=s3_output,
sagemaker_session=boto3.Session()
)
xgboost.set_hyperparameters(
objective='binary:logistic',
max_depth=5,
eta=0.2,
gamma=4,
subsample=0.8,
num_round=100
)

# fit the model
xgboost.fit({
'train': s3_input_train,
'validation': s3_input_val
})

Deploying ML Models

Once the model is built then we can move to deploying the model. Deployment can be as batch or online. We will cover online as batch is typically a less complex/straightforward design.

Step 1: Create an endpoint
The first step in deploying a trained model with Sagemaker is to create an endpoint that exposes the model as a web service. You can do this by calling the deploy method of the trained estimator object and specifying the instance type and number of instances for the endpoint.

Here’s an example code snippet to deploy an endpoint for the previously trained XGBoost model using Python boto3:

import boto3
from sagemaker import Model

s3_model_artifacts = 's3://my-s3-bucket/output/model.tar.gz'
xgboost_model = Model(
model_data=s3_model_artifacts,
image='xgboost',
role='my-sagemaker-role',
sagemaker_session=boto3.Session()
)
endpoint_name = 'my-endpoint'
xgboost_endpoint = xgboost_model.deploy(
initial_instance_count=1,
instance_type='ml.t2.medium',
endpoint_name=endpoint_name
)

In this example, we’re creating a model object from the S3 location of the previously trained model artifacts and the XGBoost Docker image. We’re also specifying the IAM role, Sagemaker session, and endpoint name. Finally, we’re deploying the endpoint with one instance of the ml.t2.medium instance type.

Step 2: Make predictions
Once the endpoint is created, you can make predictions by sending input data to the endpoint as a JSON payload. The input data should match the format expected by the model, which can be specified in the predict method of the estimator object.

Here’s an example code snippet to make predictions on the deployed endpoint using Python boto3:

import boto3
from sagemaker.predictor import RealTimePredictor
from sagemaker.content_types import CONTENT_TYPE_JSON

endpoint_name = 'my-endpoint'
xgboost_predictor = RealTimePredictor(
endpoint_name=endpoint_name,
sagemaker_session=boto3.Session(),
content_type=CONTENT_TYPE_JSON,
accept=CONTENT_TYPE_JSON
)
input_data = {
'data': [
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
[2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
]
}
predictions = xgboost_predictor.predict(input_data)

In this example, we’re creating a real-time predictor object from the deployed endpoint and specifying the input data as a JSON payload. The predict method returns the predictions as a JSON object.

Step 3: Update the model
If you want to update the deployed model with a new version, say when you get model data and have retrained the model, you can do this by creating a new model object from the updated model artifacts and calling the update_endpoint method of the endpoint object.

Here’s an example code snippet to update the deployed endpoint with a new version of the XGBoost model using Python boto3:

import boto3
from sagemaker import Model

s3_updated_model_artifacts = 's3://my-s3-bucket/output/updated-model.tar.gz'
updated_xgboost_model = Model(
model_data=s3_updated_model_artifacts,
image='xgboost',
role='my-sagemaker-role',
sagemaker_session=boto3.Session()
)
endpoint_name = 'my-endpoint'
xgboost_endpoint = updated_xgboost_model.update_endpoint(
endpoint_name=endpoint_name,
initial_instance_count= 3, instance_type='ml.m5.large')

print(xgboost_endpoint)

A full pipeline with training, retraining and deployment is discussed in the next section, utilizing the other offering within AWS.

Integration with Other AWS Services

SageMaker, since part of the AWS offerings, can easily be integrated with other service. A few commonly used ones are shown below

AWS Lambda
AWS Lambda is a serverless compute service that allows you to run code without provisioning or managing servers. You can use Lambda to trigger Sagemaker jobs based on events in other AWS services, such as S3 or DynamoDB.

Here’s an example code snippet to trigger a Sagemaker training job from a Lambda function using Python boto3 — a typical workflow that is used say when you want to re-train the model when new data is received.

import boto3

sagemaker = boto3.client('sagemaker')
def lambda_handler(event, context):
response = sagemaker.create_training_job(
TrainingJobName='my-training-job',
AlgorithmSpecification={
'TrainingImage': 'my-algorithm-image',
'TrainingInputMode': 'File'
},
RoleArn='my-sagemaker-role',
InputDataConfig=[
{
'ChannelName': 'training',
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': 's3://my-s3-bucket/training-data',
'S3DataDistributionType': 'FullyReplicated'
}
},
'ContentType': 'text/csv',
'CompressionType': 'None'
}
],
OutputDataConfig={
'S3OutputPath': 's3://my-s3-bucket/output'
},
ResourceConfig={
'InstanceCount': 1,
'InstanceType': 'ml.m4.xlarge',
'VolumeSizeInGB': 10
},
StoppingCondition={
'MaxRuntimeInSeconds': 3600
},
Tags=[
{
'Key': 'my-tag',
'Value': 'my-value'
}
]
)

In this example, we’re using the boto3 Sagemaker client to create a training job in response to a Lambda event. We're specifying the training algorithm, input data configuration, output data configuration, resource configuration, and stopping condition.

AWS Step Functions
AWS Step Functions is a serverless workflow service that allows you to coordinate the components of your application as a series of steps in a visual workflow. You can use Step Functions to create complex workflows that include Sagemaker jobs.

Here’s an example code snippet to create a Step Functions workflow that includes a Sagemaker training job using Python boto3:

import boto3

sfn = boto3.client('stepfunctions')

def create_step_function_workflow():
response = sfn.create_state_machine(
name='my-step-function',
definition='''
{
"Comment": "A simple AWS Step Functions state machine that includes a Sagemaker training job",
"StartAt": "CreateTrainingJob",
"States": {
"CreateTrainingJob": {
"Type": "Task",
"Resource": "arn:aws:states:::sagemaker:createTrainingJob.sync",
"Parameters": {
"AlgorithmSpecification": {
"TrainingImage": "my-algorithm-image",
"TrainingInputMode": "File"
},
"RoleArn": "my-sagemaker-role",
"InputDataConfig": [
{
"ChannelName": "training",
"DataSource": {
"S3DataSource": {
"S3DataType": "S3Prefix",
"S3Uri": "s3://my-s3-bucket/training-data",
"S3DataDistributionType": "FullyReplicated"
}
},
"ContentType": "text/csv",
"CompressionType": "None"
}
],
"OutputDataConfig": {
"S3OutputPath": "s3://my-s3-bucket/output"
},
"ResourceConfig": {
"InstanceCount": 1,
"InstanceType": "ml.m4.xlarge",
"VolumeSizeInGB": 10
},
"StoppingCondition": {
"MaxRuntimeInSeconds": 3600
},
"TrainingJobName": "my-training-job",
"Tags": [
{
"Key": "my-tag",
"Value": "my-value"
}
]
},
"Next": "CheckTrainingStatus"
},
"CheckTrainingStatus": {
"Type": "Task",
"Resource": "arn:aws:states:::sagemaker:describeTrainingJob",
"Parameters": {
"TrainingJobName.$": "$.training_job_name"
},
"Retry": [
{
"ErrorEquals": [
"States.TaskFailed"
],
"IntervalSeconds": 60,
"MaxAttempts": 5,
"BackoffRate": 2.0
}
],
"Catch": [
{
"ErrorEquals": [
"States.ALL"
],
"Next": "HandleError"
}
],
"End": true
},
"HandleError": {
"Type": "Task",
"Resource": "arn:aws:states:::sns:publish",
"Parameters": {
"TopicArn": "arn:aws:sns:us-west-2:123456789012:my-sns-topic",
"Message": "Training job failed: $.error"
},
"End": true
}
}
}
'''
)

In this example, we’re using the boto3 Step Functions client to create a state machine that includes two tasks: CreateTrainingJob and CheckTrainingStatus. The CreateTrainingJob task is a Sagemaker training job, while the CheckTrainingStatus task checks the status of the training job. This is a more typical workflow for training/re-training compared to AWS Lambda.

Amazon SageMaker Ground Truth
Amazon SageMaker Ground Truth is a fully-managed data labeling service that makes it easy to build highly accurate training datasets for machine learning models. With SageMaker Ground Truth, you can easily create and manage custom labeling workflows, access a global pool of skilled human labelers, and use machine learning to pre-label data and reduce the time and cost of labeling.

To integrate SageMaker Ground Truth with your data science workflows, you can use the SageMaker Ground Truth API, available in the Python boto3 SDK. The API allows you to create labeling jobs, define labeling workflows, and manage the labeling process. Here’s an example code snippet for creating a labeling job using boto3:

import boto3

sagemaker = boto3.client('sagemaker')
groundtruth = boto3.client('sagemaker', config=Config(region_name='us-west-2', s3={'use_accelerate_endpoint': True}))

input_data = {
'manifest_s3_uri': 's3://my-s3-bucket/input-manifest.manifest',
'data_attributes': {
'content_type': 'application/json'
}
}

labeling_job = {
'labeling_job_name': 'my-labeling-job',
'input_config': input_data,
'output_config': {
's3_output_path': 's3://my-s3-bucket/output',
},
'role_arn': 'arn:aws:iam::123456789012:role/AmazonSageMaker-ExecutionRole-20220501T152624',
'human_task_config': {
'pre_human_task_lambda_arn': 'arn:aws:lambda:us-west-2:123456789012:function:my-pre-task-lambda',
'task_description': 'Classify the type of item',
'number_of_workers': 3,
'task_time_limit_seconds': 300,
'task_keywords': ['item classification'],
'task_title': 'Classify the Type of Item',
'ui_config': {
'ui_template_s3_uri': 's3://my-s3-bucket/ui-template.html',
},
},
}

response = groundtruth.create_labeling_job(**labeling_job)

In this example, we’re using the boto3 SageMaker client to create a labeling job with the following parameters:

  • labeling_job_name: The name of the labeling job
  • input_config: The input data for the labeling job, including the S3 location of the input manifest file and the data content type
  • output_config: The output location for the labeling job results
  • role_arn: The ARN of the IAM role used to run the labeling job
  • human_task_config: The configuration for the human labeling task, including the pre-task Lambda function, number of workers, task time limit, task keywords, and UI template S3 location

By using SageMaker Ground Truth in your data science workflows, you can ensure that your machine learning models are trained on accurate and high-quality datasets, leading to more accurate predictions and better business outcomes.

Some of the workflows that are used in practice, where AWS SageMaker is integrated can be found below

The above post shows how SageMaker models can be packaged to be used in batch transform that can enrich services provided on the product.

This post shows how models can be automatically re-trained based on how much drift/change occures in feature values over time.

This article covers how to use AWS SageMaker in a streaming scenario.

Optimizing SageMaker Model Performance

Optimizing model performance is a critical task in machine learning, and Amazon SageMaker provides several techniques to help you achieve this, such as

Using automatic model tuning
Amazon SageMaker also provides automatic model tuning, which is a feature that helps you find the best model configuration by automating the process of hyperparameter tuning. With automatic model tuning, you can specify the range of values for each hyperparameter, and SageMaker will run multiple training jobs with different hyperparameters to find the best model. You can use the `boto3` SageMaker client to create an automatic model tuning job, and Amazon SageMaker will automatically launch and monitor the job.

import boto3

# Create a SageMaker client
sagemaker = boto3.client('sagemaker')

# Define the hyperparameters to tune
hyperparameter_ranges = {
'learning_rate': {
'min_value': '0.01',
'max_value': '0.2'
},
'batch_size': {
'min_value': '16',
'max_value': '64'
}
}

# Define the tuning job configuration
tuning_job = {
'HyperParameterTuningJobName': 'my-tuning-job',
'HyperParameterTuningJobConfig': {
'Strategy': 'Bayesian',
'HyperParameterTuningJobObjective': {
'Type': 'Minimize',
'MetricName': 'validation:loss'
},
'ResourceLimits': {
'MaxNumberOfTrainingJobs': 100,
'MaxParallelTrainingJobs': 10
},
'ParameterRanges': {
'IntegerParameterRanges': [
{
'Name': 'batch_size',
'MinValue': hyperparameter_ranges['batch_size']['min_value'],
'MaxValue': hyperparameter_ranges['batch_size']['max_value']
}
],
'ContinuousParameterRanges': [
{
'Name': 'learning_rate',
'MinValue': hyperparameter_ranges['learning_rate']['min_value'],
'MaxValue': hyperparameter_ranges['learning_rate']['max_value']
}
]
}
},
'TrainingJobDefinition': {
'AlgorithmSpecification': {
'TrainingImage': 'image-uri',
'TrainingInputMode': 'File'
},
'RoleArn': 'arn:aws:iam::012345678901:role/SageMakerRole',
'InputDataConfig': [
{
'ChannelName': 'train',
'DataSource': {
'S3DataSource': {
'S3DataType': 'S3Prefix',
'S3Uri': 's3://my-bucket/my-data',
'S3DataDistributionType': 'FullyReplicated'
}
},
'ContentType': 'text/csv',
'CompressionType': 'None'
}
],
'OutputDataConfig': {
'S3OutputPath': 's3://my-bucket/my-output'
},
'ResourceConfig': {
'InstanceType': 'ml.m5.large',
'InstanceCount': 1,
'VolumeSizeInGB': 10
},
'StoppingCondition': {
'MaxRuntimeInSeconds': 3600
}
}
}

# Start the tuning job
sagemaker.create_hyper_parameter_tuning_job(**tuning_job)

Using distributed training
Distributed training is a technique that involves splitting the training data into multiple smaller datasets and training multiple instances of a model in parallel. This technique can significantly reduce the time required to train a model, especially for large datasets. In Amazon SageMaker, you can use distributed training by specifying the number of instances and the instance type for the training job. SageMaker will automatically handle the distribution of the data and the training process across the instances.

import boto3
import sagemaker

# Set up SageMaker session and role
sagemaker_session = sagemaker.Session()
role = 'arn:aws:iam::123456789012:role/service-role/AmazonSageMaker-ExecutionRole-20220101T000001'

# Define training data location in S3
train_data = 's3://my-bucket/train/'

# Define hyperparameters for training job
hyperparameters = {'batch_size': 64, 'epochs': 10}

# Define estimator for TensorFlow training job
estimator = sagemaker.tensorflow.TensorFlow(
entry_point='train.py', # Name of the training script
source_dir='./src', # Directory containing the training script
role=role,
instance_count=2, # Number of instances for distributed training
instance_type='ml.p3.2xlarge', # Type of instance to use for training
framework_version='2.4.1', # TensorFlow version to use
py_version='py37', # Python version to use
hyperparameters=hyperparameters, # Hyperparameters to use for training
)

# Start training job on SageMaker
estimator.fit({'train': train_data})

In this example, we define the location of our training data in an S3 bucket, and specify the hyperparameters for our training job. We then create a TensorFlow estimator using the sagemaker.tensorflow.TensorFlow class, and specify the number and type of instances to use for distributed training. Finally, we start the training job on SageMaker using the fit method.

SageMaker will automatically provision and manage the instances for distributed training, and will handle data shuffling and synchronization across instances to optimize performance. With just a few lines of code, we can take advantage of the distributed computing power of SageMaker to train our machine learning models faster and more efficiently.

By using these techniques, you can optimize your machine learning models in Amazon SageMaker to achieve better accuracy, faster training times, and lower costs.

Monitoring and Debugging Sagemaker

Monitoring and debugging machine learning models is an important part of the machine learning lifecycle, and SageMaker provides tools to help with this.

One such tool is Amazon SageMaker Debugger, which can be used to monitor and analyze the training process of machine learning models in real time, and detect issues such as overfitting and underfitting.

Here’s an overview of how to use Amazon SageMaker Debugger to monitor and debug your machine learning models:

Step 1. Set up the Debugger hook
To use Amazon SageMaker Debugger, you’ll need to add a hook to your training script that sends data to Debugger during training. You can do this using the smdebug library, which is built into SageMaker. Here's an example of how to add the hook to a TensorFlow script:

from smdebug.tensorflow import KerasHook

hook = KerasHook(out_dir='/opt/ml/output/tensorboard', save_all=True)

model.fit(x_train, y_train, validation_data=(x_test, y_test), epochs=10, callbacks=[hook])

In this example, we create a KerasHook object and pass it to the callbacks argument of the model.fit method. The out_dir argument specifies the directory where the Debugger data will be saved.

Step 2. Configure a Debugger rule
After you’ve added the Debugger hook to your training script, you can configure a Debugger rule to detect issues such as overfitting and underfitting. You can do this using the Rule class from the smdebug.rules module. Here's an example of how to create a rule to detect overfitting:

from smdebug.rules import Rule, GenericRule

rule = Rule.sagemaker(rule_configs.GenericOverfit(), rule_parameters={'local_steps': 100, 'delta': 0.05})

In this example, we create a GenericRule object with the GenericOverfit rule configuration, and specify the local_steps and delta parameters. The local_steps parameter specifies the number of steps over which to compute the overfitting metric, and the delta parameter specifies the threshold for detecting overfitting.

Step 3. Start a SageMaker training job
Once you’ve added the Debugger hook to your training script and configured a Debugger rule, you can start a SageMaker training job with debugging enabled. Here’s an example of how to start a training job with debugging enabled:

import sagemaker.debugger

debugger_hook_config = sagemaker.debugger.HookConfiguration()
debugger_hook_config.debugger_hook_enabled = True
debugger_hook_config.debugger_path = '/opt/ml/output/tensorboard'
estimator = TensorFlow(entry_point='train.py', role=role, train_instance_count=1, train_instance_type='ml.p3.2xlarge', framework_version='2.4.1', py_version='py37', script_mode=True, hyperparameters={'epochs': 10})
estimator.fit(inputs={'train': train_data}, wait=True, job_name=job_name, experiment_config=experiment_config, debugger_hook_config=debugger_hook_config)

In this example, we create a HookConfiguration object and enable the Debugger hook by setting debugger_hook_enabled to True and specifying the path to the output directory. We then create a TensorFlow estimator and start a training job with debugging enabled using the debugger_hook_config argument.

During training, Debugger will collect data such as training and validation metrics, weights and biases, and other information about the training process. You can then use the SageMaker Debugger UI to analyze this data and detect issues such as overfitting and underfitting.

Overall, using Amazon SageMaker Debugger is a powerful way to improve the quality and performance of your machine learning models. By detecting and addressing issues early in the training process, you can save time and resources and achieve better results.

Cost Optimizing with Sagemaker

Optimizing costs is an important aspect of using AWS Sagemaker. Here are some tips on how to do this:

Use Spot Instances
Spot Instances allow you to bid on unused EC2 instances and run your Sagemaker jobs at a lower cost. You can use the Spot Instance Advisor in the AWS Management Console to help you choose the best instance type and region for your needs.

Here’s an example of how to use Spot Instances with Python-boto3:

from sagemaker import Session

session = Session()

# Configure the training job
instance_type = 'ml.p3.2xlarge'
spot_instance_type = 'ml.p3.2xlarge'
spot_instance_count = 1
max_wait_time = 3600
base_job_name = 'my-training-job'

# Start the training job with Spot Instances
session.train(
base_job_name=base_job_name,
input_mode='File',
input_config=input_config,
output_config=output_config,
hyperparameters=hyperparameters,
algorithm_arn=algorithm_arn,
role=role,
train_instance_count=spot_instance_count,
train_instance_type=spot_instance_type,
max_wait=max_wait_time,
use_spot_instances=True, # THIS IS THE IMP CODE
)

Use Amazon EC2 Reserved Instances
Reserved Instances allow you to save up to 75% on EC2 instances compared to on-demand instances. You can purchase Reserved Instances for a 1-year or 3-year term and use them to run your Sagemaker jobs.

Here’s an example of how to use Reserved Instances with Python-boto3:

import boto3

ec2 = boto3.resource('ec2')

# Purchase a Reserved Instance for a 1-year term
instance_type = 'ml.p3.2xlarge'
instance_count = 1
reservation = ec2.create_reserved_instances_offering(
AvailabilityZone='us-west-2a',
InstanceType=instance_type,
OfferingType='Partial Upfront',
Duration=31536000,
InstanceCount=instance_count,
)

# Launch a Sagemaker training job using the Reserved Instance
session = Session()

session.train(
base_job_name=base_job_name,
input_mode='File',
input_config=input_config,
output_config=output_config,
hyperparameters=hyperparameters,
algorithm_arn=algorithm_arn,
role=role,
train_instance_count=instance_count,
train_instance_type=instance_type,
max_wait=max_wait_time,
use_spot_instances=False,
reserved_instance_id=reservation.id,
)

Set up a budget and cost allocation tags
You can set up a budget in AWS to monitor your Sagemaker costs and get alerts when you reach a certain threshold. You can also use cost allocation tags to track your Sagemaker costs by project or team and allocate costs accordingly.

Here’s an example of how to set up a budget in AWS using Python-boto3:

import boto3

budgets = boto3.client('budgets')

# Create a budget with a monthly limit of $1,000 for Sagemaker costs
budget_name = 'my-sagemaker-budget'
limit_amount = 1000
limit_unit = 'USD'
time_unit = 'MONTHLY'
account_id = boto3.client('sts').get_caller_identity().get('Account')

response = budgets.create_budget(
AccountId=account_id,
Budget={
'BudgetName': budget_name,
'BudgetLimit': {
'Amount': str(limit_amount),
'Unit': limit_unit
},
'CostFilters': {
'Service': 'AmazonSageMaker'
},
'BudgetType': 'COST'
},
TimeUnit=time_unit
)

print(response)

This code creates an AWS budget with a monthly limit of $1,000 for Sagemaker costs. The create_budget method of the budgets client is called with the required parameters. The AccountId parameter is set to the account ID of the current user, which is obtained using the sts client. The Budget parameter is a dictionary that contains the name of the budget, the budget limit, and cost filters to restrict the budget to Sagemaker costs. The BudgetType parameter is set to COST to indicate that the budget is based on cost. Finally, the TimeUnit parameter is set to MONTHLY to indicate that the budget is a monthly budget. The response from the create_budget method is printed to the console.

Other Best Practices for SageMaker Models

Using AWS Sagemaker effectively involves several best practices that can help you optimize performance and minimize costs. Here are some best practices to keep in mind:

  • Select the right instance types: Sagemaker provides a range of instance types optimized for different workloads, such as CPU-based or GPU-based instances. Choosing the right instance type can significantly impact performance and cost.
  • Monitor and log model performance: Use Amazon CloudWatch and Sagemaker logging to monitor your model’s performance and identify any issues, such as overfitting or underfitting.
  • Version your models: Versioning your models allows you to keep track of changes over time and rollback to a previous version if necessary.
  • Use automated machine learning (AutoML): AutoML can help automate the process of building and tuning machine learning models, reducing the time and effort required.
  • Validate your data: Ensure your data is of high quality and properly labeled to avoid issues with model performance.
  • Optimize hyperparameters: Use techniques like grid search or Bayesian optimization to find the best hyperparameters for your model.
  • Estimate costs: Be aware of the costs associated with running Sagemaker and use tools like AWS Cost Explorer to estimate and manage costs.

Here are some common pitfalls to avoid when using Sagemaker:

  • Overfitting: Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new data. Regularization techniques like L1 and L2 regularization can help prevent overfitting.
  • Underfitting: Underfitting occurs when a model is too simple and cannot capture the complexity of the data, resulting in poor performance. Increasing the model’s complexity or using a more powerful algorithm can help prevent underfitting.
  • Poor data quality: Ensure your data is of high quality and properly labeled to avoid issues with model performance.
  • Underestimating costs: Sagemaker can be expensive, and underestimating costs can result in unexpected bills. Be sure to estimate and manage costs using tools like AWS Cost Explorer.

By following these best practices and avoiding common pitfalls, you can use Sagemaker effectively to build and deploy machine learning models in the cloud.

Conclusion

Amazon SageMaker is a powerful platform that can help data scientists build, train, and deploy machine learning models quickly and efficiently. With its flexible infrastructure, built-in algorithms, and easy integration with other AWS services, SageMaker is an excellent tool for both beginners and advanced users.

In this blog post, we’ve covered the key concepts of using SageMaker with Python boto3, including setting up an environment, preparing data, building and training models, deploying models, optimizing performance, and monitoring and debugging models. By mastering these concepts, you can unlock the full potential of SageMaker and take your machine learning projects to the next level.

Credits

This post was written with help from ChatGPT. Some of the promopts used are

Provide an overview of what AWS Sagemaker is, why it’s useful for data scientists, and how it can be used for building and deploying machine learning models. Your tone should be slightly friendly and humorous.

Explain how to set up an AWS Sagemaker environment, including creating an IAM role, setting up S3 buckets, and configuring the Sagemaker instance. Include python boto3 code

Walk through how to prepare your data for use in Sagemaker, including loading data into S3 buckets, cleaning and formatting the data, and splitting it into training and validation sets.

--

--

Karun Thankachan
CodeX

Simplifying data science concepts and domains. Get free 1-on-1 coaching @ https://topmate.io/karun