ML training using AWS Sagemaker

Nokku Prudhvi
Thomson Reuters Labs
7 min readDec 29, 2022

Nokku, Prudhvi (TR Technology)

Model training is an important step in ML-Lifecycle, which results in a working model that can be eventually put into an application for the end users. Training a model is required to understand the various patterns, rules, and features.

Since using a systematic and repeatable model training process is of paramount importance for any organization planning to build a successful machine learning model at scale,

in this blog, we are going to see how we can easily train an ML model using AWS SageMaker training jobs, particularly in script mode.

AWS SageMaker Training Job:

Amazon SageMaker is a fully managed service for data science and machine learning (ML) workflows. You can use Amazon SageMaker to simplify the building, training, and deploying of ML models. AWS SageMaker Training Job is one of its components using which one can do training with a good range of instances with different capabilities of your choice like GPU, CPU, Network-bandwidth, etc. AWS SageMaker training jobs make it easy for users to run the training efficiently in AWS-cloud.

After you create the training job, SageMaker launches the ML compute instances and uses the training code and the training dataset to train the model. It saves the resulting model artifacts and other outputs in the S3 bucket you specified for that purpose.

When the job is triggered, training can take place locally or on any compute instance of AWS of your choice. If you want the training to run on instances other than local, be aware that it may take a couple of minutes for AWS to provide our requested training instance. But recently AWS is overcoming this issue with the help of the warm-pools option in triggering training jobs.

With the advantage of the cloud, with good computing options and with the pay-as-you-use concept, let us see how we can make use of it in our ML training.

Different ways of training using AWS Sagemaker:

Using AWS Sagemaker, we can train our model in the following ways:

  • Using prebuilt algorithms
  • Using script mode
  • Using custom docker containers

Prebuilt-Algorithms:

AWS SageMaker provides built-in training algorithms and pre-trained models. If one of these meets your needs, you can use it for quick model training. For a list of algorithms provided by AWS SageMaker, see Use Amazon SageMaker Built-in Algorithms or Pre-trained Models.

Script-mode:

In script mode, you can pass your training scripts and can use AWS SageMaker’s prebuilt containers for various frameworks like Scikit-learn and XGBoost, HuggingFace, etc.

You have the option to include your own requirements file and your own dependencies, such as a custom Python library in your training which we will discuss in detail.

In this blog, I will take an example of using the HuggingFace framework, but the procedure, syntax, and parameters are still valid for other frameworks like PyTorch, Tensorflow, XGBoost, Scikit-learn, etc as well. Just by changing HuggingFace to say Pytorch, you can achieve your goal of training with Pytorch Framework but there can be some additional parameters specific to the framework which will be discussed below.

Custom Docker Containers:

This option is the most customized one. you can have your own docker container and can extend the containers provided by AWS.

When you are using a framework such as TensorFlow, MXNet, PyTorch, or Chainer that has direct support in AWS SageMaker, you can simply supply the Python code that implements your algorithm using the SDK entry points for that framework.

See the recommended options provided by AWS in extending the framework:

1. Install additional dependencies. (E.g., I want to install a specific Python library, that the current AWS SageMaker containers do not install.) 2. Configure your environment. (E.g., I want to add an environment variable to my container.)

Script Mode in Detail:

Let us see the script mode in detail.

As a pre-requisite, let's set up the environment for running jobs.

In this blog, we use AWS sagemaker-studio-notebook as our environment. You can also use local or other computing as your environment. You can also link this training job with processing jobs and evaluation via aws-sagemaker-pipeline and can trigger the pipeline.

Once you create a studio app, you can launch the app and create a notebook.

Once you have a notebook, you can run the below code snippet and can see your ML training gets started.

Code:

import sagemaker 
role = sagemaker.get_execution_role()

train_s3_uri= 's3://sagemaker-us-east-1–558105141721/samples/datasets/imdb/train'
test_s3_uri= 's3://sagemaker-us-east-1–558105141721/samples/datasets/imdb/test'

enable_local_mode_training = False

if enable_local_mode_training:
train_instance_type = "local"
inputs = {"train": f"file://{train_dir}", "test": f"file://{test_dir}"}
else:
train_instance_type = "ml.p3.2xlarge"
inputs = {"train": train_s3_uri, "test": test_s3_uri}


# hyperparameters which are passed to the training job
hyperparameters= {'epochs': 1,
'per_device_train_batch_size': 32,
'model_name_or_path': 'distilbert-base-uncased'
}


huggingface_estimator_parameters= {
'entry_point':'train.py',
'source_dir' : './scripts',
'instance_type':train_instance_type ,
'instance_count' : 1,
'role': role,
'hyperparameters': hyperparameters,
'base_job_name': 'huggingface-training-job',
'transformers_version':'4.4',
'pytorch_version':'1.6',
'py_version':'py36'

}

from sagemaker.huggingface import HuggingFace

# create the Estimator
huggingface_estimator = HuggingFace(**estimator_parameters )
huggingface_estimator.fit(inputs)

Implementation:

Let us see what is happening here and what are the estimator parameters and how to set them.

What happens when we call estimator.fit :

What is happening here is, via aws-sagemaker-training-jobs, aws is creating an environment by spinning up docker containers which are deep-learning frameworks, installing our requirements, and running our train script which is being passed via entry_point.

Estimator Parameters:

entry_point (str or PipelineVariable) — Path (absolute or relative) to the Python source file which should be executed as the entry point to training. If source_dir is specified, then entry_point must point to a file located at the root of source_dir.

source_dir (str or PipelineVariable) — Path (absolute, relative or an S3 URI) to a directory with any other training source code dependencies aside from the entry point file (default: None). You can have your requirements.txt file in this folder. If source_dir is an S3 URI, it must point to a tar.gz file. Structure within this directory is preserved when training on Amazon SageMaker.

instance_type (str) — The EC2 instance type to train the model on. For example, ‘ml.p3.2xlarge’, or ‘local’ for local mode. You can check the pricing from here

instance_count (str) — Number of instances required for training. You can also use more than 1 to do multi-node distributed training.

role (str) — The permissions used to execute training jobs.

hyperparameters (dict[str, str] or dict[str, PipelineVariable]) — Hyperparameters that will be used for training (default: None). The hyperparameters are made accessible as a dict[str, str] to the training code on AWS SageMaker. You can parse the hyperparameters using parser () as shown below via entry_point script. This is a way to pass parameters that AWS Sagemaker training jobs can use. For convenience, this accepts other types of keys and values, but str () will be called to convert them before training.

transformers_version (str) — Transformers version you want to use for executing your model training code. Defaults to None. Required unless image_uri is provided. The currently supported version is 4.6.1.

pytorch_version (str) — PyTorch version you want to use for executing your model training code. Defaults to None. Required unless tensorflow_version is provided. The currently supported versions are 1.7.1 and 1.6.0.

py_version (str) — Python version you want to use for executing your model training code. Defaults to None. Required unless image_uri is provided. If using PyTorch, the currently supported version is py36. If using TensorFlow, the currently supported version is py37

Other Estimator parameters:

These parameters are mostly the same across all estimators. But you can cross-verify them from here.

You can visit the framework of interest from above and select an estimator, for example, the HuggingFace estimator as here.

For your quick reference, I have attached code for SKlearn and Pytorch frameworks as well:

SKlearn:

hyperparameters = {"max_depth": 20, "n_jobs": 4, "n_estimators": 120} 

sklearn_estimator_parameters= {
'entry_point':'train.py',
'source_dir' : './scripts',
'instance_type':train_instance_type ,
'instance_count' : 1,
'role': role,
'hyperparameters': hyperparameters,
'base_job_name': 'sklearn-training-job',
'framework_version': "1.0–1",
'py_version': "py3"
}

from sagemaker.sklearn.estimator import SKLearn

# Create the Estimator
sklearn_estimator = SKLearn(**sklearn_estimator_parameters )
sklearn_estimator.fit(inputs)

Pytorch:

hyperparameters = {"epochs": 25, "batch_size": 128, "learning_rate": 0.01} 

pytorch_estimator_parameters= {
'entry_point':'train.py',
'source_dir' : './scripts',
'instance_type':train_instance_type ,
'instance_count' : 1,
'role': role,
'hyperparameters': hyperparameters,
'base_job_name': 'pytorch-training-job',
"framework_version": "1.5",
"py_version": "py3"

}

from sagemaker.pytorch import PyTorch

# Create the Estimator
pytorch_estimator = PyTorch(**pytorch_estimator_parameters )
pytorch_estimator.fit(inputs)

train.py — entry-point script:

So, let us discuss more regarding the entry_point script that we can pass to AWS Sagemaker.

The starting part of the python file, where we import the libraries, take input of hyperparameters and use any AWS Sagemaker environment variables are the only main parts that are dependent and coming from the training job script which we have seen above. The rest of the logic is custom and can be specific to the model logic you want to train. The sample template is as below.

Code:

import transformers 
import datasets
import argparse
import os

if __name__ == "__main__":

parser = argparse.ArgumentParser()

# Hyperparameters sent by the client are passed as command-line arguments to script
parser.add_argument(" - epochs", type=int, default=3)
parser.add_argument(" - per_device_train_batch_size", type=int, default=32)
parser.add_argument(" - model_name_or_path", type=str)

# data, model, and output directories
parser.add_argument(" - model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
parser.add_argument(" - training_dir",type=str, default=os.environ["SM_CHANNEL_TRAIN"])
parser.add_argument(" - test_dir", type=str, default=os.environ["SM_CHANNEL_TEST"])

Apart from this script other code can be your custom logic such as model.fit() or in this HuggingFace scenario, it can be trainer.train(), etc….

The hyperparameters defined in the Hugging Face Estimator are passed as named arguments and processed by ArgumentParser().

You can refer to the complete custom logic for this example here.

But in this blog scope, we are not discussing how to write custom logic for different scenarios and frameworks.

Once we have done training, if you want to test your trained model, you can do it as part of custom logic in train.py file for simplicity, for example, if in your custom-logic, after you trained the model like after model.fit(X_train, y_train), you can include model.score(X_test, y_test) or model.predict () for getting results…. Other options are below:

For testing, we can deploy a Sagemaker endpoint and test or run a Sagemaker processing job for evaluation or run a batch-transform job.

Conclusion:

We saw different ways to train a model and saw script mode in detail. Then we created a developer environment and saw how we can make use of training jobs with different parameters and then we saw how train.py looks like and how we can include custom logic.

Resources:

https://sagemaker.readthedocs.io/en/stable/frameworks/index.html

https://sagemaker.readthedocs.io/en/stable/frameworks/huggingface/sagemaker.huggingface.html#hugging-face-estimator

https://huggingface.co/docs/sagemaker/index

--

--