Create a Custom Training Job With Your Own Algorithm in Sagemaker

11 min readDec 26, 2022

AWS Sagemaker is one of the most advanced machine learning services in the cloud world. If you want to learn and use Sagemaker, you need to be familiar with Docker concept as Sagemaker requires and works with docker containers. In this post, I will share how to dockerize your custom machine learning code for model training with Sagemaker. In the world of Sagemaker, the approach of bringing your own code and training logic into Sagemaker is called bring your own container (BYOC).

We will go over:

creating python virtual environment using pipenv to easily ship exact same python packages in our local into Docker container,
organising our training code in the way Sagemaker Docker Container requires
building our image and then pushing it to AWS ECR
triggering Sagemaker training job
saving the model artefacts to an s3 path

1-Creating Python Virtual Environment using pipenv

I assume that you are already familiar with pipenv. If you want to learn more about it (how to install and use for instance), you can visit https://pipenv.pypa.io/en/latest/.

Let’s assume that we are using Python 3.8.12. To create a virtual python environment, create a project folder, go into the folder, create an empty file in it called Pipfile, and run the following code in the root of the project folder:

pipenv — python 3.8.12 shell

If you do not have python 3.8.12 installed in your local, it will prompt you asking to install it. You can install it by pressin ‘y’.

Congrats, you have created a python virtual environment and activated it. To install your desired package (such as sklearn), you can run

pipenv install scikit-learn==1.2.0

It will install scikit-learn with version 1.2.0 and update Pipfile accordingly. It also creates another file called Pipfile.lock which stores all the necessary package informations for versioning and package consistency for dependency management.

Now, you should have Pipfile and Pipfile.lock that store all the metadata for your python virtual environment.

2-Organising the Codes for Sagemaker Docker Container

Docker container for Sagemaker must have certain folder structure and file naming. We will cover the details in the next section, but in a nutshell, when you provide a container to Sagemaker for model training, it will look for a file called train in the working directory, which is an executable file. Then it will run the code inside train file. When it executes all the codes, it will finish training with a proper status message.

So this train file serve as an entry point to our training codes. This train file must have references (via importing) to all other files that we use during training such as files for preprocessing, pipeline generation, model training, evaluation, and so on.

Let’s create a folder called src in the project root and move all the model codes to it. In our example, we have only train file with all the codes in it. It looks like following:

#!/usr/bin/env python3

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
import logging
import pickle
import os
import pandas as pd

logging.basicConfig(level=logging.INFO)


# READING TRAINING_FILE_NAME ENV VARIABLE
training_file_name = os.getenv("TRAINING_FILE_NAME", "master_data.csv")

#STANDART SAGEMAKER PATH
prefix = "/opt/ml"

#TRAINING DATA PATH
data_path = os.path.join(prefix, "input/data/train/")
data_file = os.path.join(data_path, training_file_name)

# ARTEFACT PATH
model_path = os.path.join(prefix, "model")

# HYPERPARAMETERS PATH
hyperparameters_path = os.path.join(
    prefix, "input", "config", "hyperparameters.json"
)

try:
    hyperparameters = json.loads(hyperparameters_path)
except:
    hyperparameters = {}

################
# MODEL TRAINING
################
logging.info(f'Model will run with the following hyperparameters: {hyperparameters}')

target_name = 'target'

master_df = pd.read_csv(data_file, sep=',')

X = master_df.drop(target_name, axis=1)
y = master_df[target_name]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


clf=RandomForestClassifier(n_estimators=hyperparameters.get('n_estimators', 100))

clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)

##################
# SAVING ARTEFACTS
##################
accuracy = metrics.accuracy_score(y_test, y_pred)
logging.info(f"Model accuracy: {accuracy}")

feature_imp = pd.Series(clf.feature_importances_,index=X.columns).sort_values(ascending=False)
logging.info(f"Feature importance: \n {feature_imp}")

with open(os.path.join(model_path, "model.pckl"), 'wb') as handle:
    pickle.dump(clf, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
with open(os.path.join(model_path, "feature_importance.pckl"), 'wb') as handle:
    pickle.dump(feature_imp, handle, protocol=pickle.HIGHEST_PROTOCOL)

Note that Sagemaker doesn’t know how to execute this train file by default. By adding the shebang #!/usr/bin/env python3 in the first line, we are saying Sagemaker that this is a python file so please execute it as python.

In this code, how we have defined training_file_name, data_path, data_file, model_path, and hyperparameters_path variables might seem a little bit confusing to you at this step, but we will understand why they must be defined in this way in step 5.

So at the end of this step, you should have src folder in the project root which includes train (and any other files that it references such as preprocessing.py, model_pipeline.py and evaluation.py files).

3-Creating Sagemaker Docker Container

Now we have our Pipfile and Pipfile.lock and have structured our codes as described in the previous section. Let’s start to create Dockerfile for creating our own docker image.

In the root of project directory, create a file called Dockerfile, and copy and paste the following code into Dockerfile:

FROM python:3.8

RUN apt-get -y update && apt-get install -y --no-install-recommends \
    libusb-1.0-0-dev \
    libudev-dev \
    build-essential \
    ca-certificates && \
    rm -fr /var/lib/apt/lists/*

# Keep python from buffering the stdout - so the logs flushed quickly
ENV PYTHONUNBUFFERED=TRUE

# Don't compile bytecode
ENV PYTHONDONTWRITEBYTECODE=TRUE

ENV PATH="/opt/app:${PATH}"

ENV PYTHONPATH=.

RUN pip3 install pipenv==2022.7.4

# Install packages
WORKDIR /opt/app
COPY Pipfile Pipfile.lock ./
RUN pipenv install --deploy --system --dev

# Add src code
COPY src ./
RUN chmod +x train

Lets go over the codes in Dockerfile one by one:

First we will use python:3.8 docker image:


FROM python:3.8

Then using the following codes, we install some dependencies for the container to work with Sagemaker:

RUN apt-get -y update && apt-get install -y --no-install-recommends \
    libusb-1.0-0-dev \
    libudev-dev \
    build-essential \
    ca-certificates && \
    rm -fr /var/lib/apt/lists/*

Then add the required environment variables:

# Keep python from buffering the stdout - so the logs flushed quickly
ENV PYTHONUNBUFFERED=TRUE

# Don't compile bytecode
ENV PYTHONDONTWRITEBYTECODE=TRUE

ENV PATH="/opt/app:${PATH}"

ENV PYTHONPATH=.

Then, we need to install pipenv with a desired version to create the exact python environment using our Pipfile and Pipfile.lock from step 1:

RUN pip3 install pipenv==2022.7.4

# Install packages
WORKDIR /opt/app
COPY Pipfile Pipfile.lock ./
RUN pipenv install --deploy --system --dev

Add the read/write permission for train file so that it can be executed in the docker container:

# Add src code
COPY src ./
RUN chmod +x train

That is all! You now have a Dockerfile that includes necessary instructions to create the desired Sagemaker image.

4-Registering the Docker Container to AWS ECR

Sagemaker works with AWS Elastic Container Registry (ECR) service to get the required image. So we need to push our image to AWS ECR so that Sagemaker can consume it.

To build and push the image to AWS ECR, we can use the script provided by AWS here. Download the file build_and_push.sh and put it into your project’s root directory where Dockerfile resides. After doing some little changes in original file here, my build_and_push.sh file looks like the following:

#!/usr/bin/env bash

# This script shows how to build the Docker image and push it to ECR to be ready for use
# by SageMaker.

# The argument to this script is the image name. This will be used as the image on the local
# machine and combined with the account and region to form the repository name for ECR.
image=$1

if [ "$image" == "" ]
then
    echo "Usage: $0 <image-name>"
    exit 1
fi

chmod +x src/train

# Get the account number associated with the current IAM credentials
account=$(aws sts get-caller-identity --query Account --output text)

if [ $? -ne 0 ]
then
    exit 255
fi

fullname="${account}.dkr.ecr.${AWS_DEFAULT_REGION}.amazonaws.com/${image}:latest"

# If the repository doesn't exist in ECR, create it.

aws ecr describe-repositories --repository-names "${image}" > /dev/null 2>&1

if [ $? -ne 0 ]
then
    aws ecr create-repository --repository-name "${image}" > /dev/null
fi

# Get the login command from ECR and execute it directly
aws ecr get-login-password --region "${AWS_DEFAULT_REGION}" | docker login --username AWS --password-stdin "${account}".dkr.ecr."${AWS_DEFAULT_REGION}".amazonaws.com

# Build the docker image locally with the image name and then push it to ECR
# with the full name.

docker build  -t ${image} .
docker tag ${image} ${fullname}

docker push ${fullname}

For this code to work, you need to have AWS credentials in your environment variables. If you don’t have, you can easily define them by creating the following environment variables in a shell with valid credentials for your AWS account (the below is just an example) before running the script:

export AWS_ACCESS_KEY_ID=AKIAIOSEXAMPLE 
export AWS_SECRET_ACCESS_KEY=wJalrRfiCYEXAMPLEKEY
export AWS_DEFAULT_REGION=eu-west-1

Then in the same shell, run build_and_push.sh script providing it with a name to be used while registering image to ECR (don’t forget to navigate to the project root if you are in a different folder):

sh build_and_push.sh byoc_image

It will create docker image specified in our Dockerfile, and register it to AWS ECR with byoc_image name. If your credentials are valid, but you are not able to push your image to AWS ECR, then your IAM User with those credentials is not authorised to push images to ECR. You may need to give necessary permissions to your user to achieve it.

After a successful build and push, you should see your image under AWS ECR service like this:

5-Running Training Job

There are many ways to trigger a training job in Sagemaker, we will focus on triggering it via python boto3 package.

To connect with the Sagemaker service, we first need to get a sagemaker client using boto3:

import boto3

sagemaker_client = boto3.client(
    "sagemaker",
    region_name=REGION,
    aws_access_key_id=AWS_ACCESS_KEY_ID,
    aws_secret_access_key=AWS_SECRET_ACCESS_KEY,
)

Then using that sagemaker client, we need to create a training job using create_training_job function:

import time

name = 'byoc-training-'+time.strftime('%Y-%m-%d-%H-%M-%S')
hyperparameters = {
    'n_estimator': '100'
}
environment_variables= {
    'TRAINING_FILE_NAME': 'master_df.csv',
    'FEATURES': "sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)"
}
ecr_container_url = '${YOUR_ACCOUNT_ID}.dkr.ecr.eu-west-1.amazonaws.com/byoc_image:latest'
sagemaker_role = 'arn:aws:iam::${YOUR_ACCOUNT_ID}:role/sagemaker-full-access-role'
input_s3_uri = 's3://${YOUR_BUCKET}/master_df.csv'
output_bucket = 's3://${YOUR_BUCKET}/model_artefacts/'
instance_type = 'ml.m5.4xlarge'
instance_count = 1
memory_volume = 8

_ = sagemaker_client.create_training_job(
            
            #a custom name for training job
            TrainingJobName=name,
            
            #hyperparameters in as python dictionary
            HyperParameters=hyperparameters,

            #hyperparameters in as python dictionary 
            AlgorithmSpecification={
                # our container url from step 4
                'TrainingImage': ecr_container_url,
                'TrainingInputMode': 'File'
            },

            # a IAM role with proper permissions so that sagemaker can run
            RoleArn=sagemaker_role,

            
            InputDataConfig=[{
                'ChannelName': 'train',
                'DataSource': {
                    "S3DataSource": {
                        "S3DataType": "S3Prefix",
                        "S3Uri": input_s3_uri,
                        "S3DataDistributionType": "FullyReplicated"
                    }
                },
                'ContentType': 'text/csv',
                'CompressionType': 'None'
            }],


            OutputDataConfig={"S3OutputPath": output_bucket},

            ResourceConfig={
                "InstanceType": instance_type,
                "InstanceCount": instance_count,
                "VolumeSizeInGB": memory_volume
            },

            Environment=environment_variables,
            
            StoppingCondition={'MaxRuntimeInSeconds': 43200}
)

When we run this code, AWS will simply spin up a special EC2 instance using provided Docker container, then run the model codes in it. To run the model, it uses information we provide in the arguments of create_training_job function.

So to understand how training is done, let’s go over those parameters one by one:

TrainingJobName (str):
The name of the training job. The name must be unique within an Amazon Web Services Region in an Amazon Web Services account.
HyperParameters (dict):
Algorithm-specific parameters that influence the quality of the model. You set hyperparameters before you start the learning process. When you provide this argument, Sagemaker will save this dictionary as json under /opt/ml/input/config/hyperparameters.json folder in the running container instance:

/opt/ml
|-- input
|   |-- config
|   |   |-- hyperparameters.json ------> HERE

So in our training code, to access those hyperparameters, we need to load it from /opt/ml/input/config/hyperparameters.json path.

⚠️ That is the reason why we defined the variable hyperparameters_path in that way in step 2.

AlgorithmSpecification:
The definition of container Sagemaker is going to use for training job. It is the one we created in Step 4
RoleArn:
The Amazon Resource Name (ARN) that SageMaker assumes to perform tasks on your behalf during model training. You must grant this role the necessary permissions so that SageMaker can successfully complete model training.
InputDataConfig:
The definition of training data. In this example, we have only one data channel called train, and the data for that channel resides in s3 under input_s3_uri, its type is text/csv, and it is not compressed.
What sagemaker will do before running training codes is to fetch/download the text/csv data from input_s3_uri into /opt/ml/input/data/train/the_file_name of running container instance. The train folder under /opt/ml/input/data/ comes from the ChannelName value:

/opt/ml
|-- input
|   |-- config
|   |   |-- hyperparameters.json
|   |   `-- resourceConfig.json
|   `-- data
|       `-- <channel_name>
|           `-- <input data> ------> HERE

So in our training code, we must read the training data from /opt/ml/input/data/train/the_file_name.

⚠️ That is the reason why we defined the variable data_file in that way in step 2.

OutputDataConfig:
The definition of location in which Sagemaker will store training artefacts after training job is done like trained model file, performance metrics, other logs/outputs to keep. In our example code, we specify an s3 path to store all the training artefacts. But how does Sagemaker know what to store after training job is done? Well, what it does is simply export everything under container’s /opt/ml/model/ folder. So whatever we want to keep after training, we should save them under /opt/ml/model/ folder, otherwise any produced files after training will be lost as the running container instance is halted.

/opt/ml
|-- input
|   |-- config
|   |   |-- hyperparameters.json
|   |   `-- resourceConfig.json
|   `-- data
|       `-- <channel_name>
|           `-- <input data>
|-- model
|   `-- <model files> ------> HERE

⚠️ That is the reason why we stored the trained model and feature_importance objects under /opt/ml/model/ folder in step 2.

ResourceConfig:
This is where we define what kind of computing instance we need to use for training job
Environment:
The environment variables in dictionary format. Sagemaker will ingest those variables into running container instance while setting it up. We can access them as environment variables in our training codes.

⚠️ With this definition, we are able to read the value of environment variable TRAINING_FILE_NAME in step 2:
training_file_name = os.getenv(“TRAINING_FILE_NAME”, “master_data.csv”)

StoppingCondition:
Specifies a limit to how long a model training job can run. It also specifies how long a managed Spot training job has to complete. When the job reaches the time limit, SageMaker ends the training job.

After we run create_training_job function, it will return something like the following:

You can use the following code to wait until the training end:

waiter = sagemaker_client.get_waiter('training_job_completed_or_stopped')

waiter.wait(
    TrainingJobName=name,
    WaiterConfig={
        'Delay': 123,
        'MaxAttempts': 123
    }
)

status = sagemaker_client.describe_training_job(
    TrainingJobName=name
)

print(status['TrainingJobStatus'])
# Completed

status['ModelArtifacts']['S3ModelArtifacts']
# 's3://<<YOUR_BUCKET>>/model_artefacts/byoc-training-2022-12-26-21-12-48/output/model.tar.gz'

After a successful training, Sagemaker must have stored all the model artefacts under s3://<<YOUR_BUCKET>>/model_artefacts/byoc-training-2022–12–26–21–12–48/output/model.tar.gz. When we expand this tar file, we will see model.pckl and feature_importance.pckl files that we store in training code in step 2.

6-Monitoring Training Jobs in AWS Console

First we go to Sagemaker console, then from left panel navitage to Training/Training Jobs. We see our training job with its details:

Specifically, we can see the parameters we provided for create_training_job function:

7-Conclusion

In this post we have discussed how to create our own training job in Sagemaker. Specifically we have seen:

how to use pipenv to create package environment to easily reproduce our python virtual environment
how to organize our codes according to the format Sagemaker Docker Container requires
how to use Pipfile and Pipfile.lock to easily Dockerize our codes with required packages with specific versions
how to build and push our image Docker image to AWS ECR
how to trigger a training job in Sagemaker

In this post, we didn’t cover how to serve models in Sagemaker. We will discuss it in the next post.