Move from local Jupyter to Amazon SageMaker — Part 2

Vikesh Pandey
5 min readJan 6, 2023

--

Authors: Vikesh Pandey, Othmane Hamzaoui

In Move from local jupyter to Amazon SageMaker — Part 1, we covered how you can take an off-the-shelf example from PyTorch Tutorials and run it using SageMaker training API. We covered how we can run code on ephemeral managed training clusters and how to persist your models by saving them to S3 after the training is done. In this part, we will explain how to point your training script to your own data and how to add your own hyperparameters and configuration files.

We will break down this blog in the following sections:

  1. Change training data location to S3 and define hyperparameters
  2. Changes to the training script
  3. Changes to SageMaker training API

Assumptions

Here we are assuming that you have already gone through the part-1 and setup SageMaker Studio IDE. You have run through the exercise of part-1 and trained the model successfully.

With that said, let’s move ahead.

1. Change training data location to S3 and define hyperparameters

In Part-1, we used PyTorch DataSet APIs to read the dataset which was already hosted by PyTorch. That won’t be the case for any real-world use-case.
In your project, you will have the data sitting is some data lake or data warehouse or even your local machine. Lets extend the same notebook and training script we used in Part-1 and make necessary changes to read data from a remote location. SageMaker has a native integration with Amazon S3, so we will showcase the changes assuming your data is in S3.

NOTE: If you ran the notebook related to Part-1, it would already download the data in your Studio Environment in a folder called data. So, we are going to do the following now:

  1. Upload the data from local folder to an S3 location
  2. Tell SageMaker to point to the S3 locations for picking up data.
  3. Make changes in the training script of Part-1 to supply the data locations and hyperparameters via SageMaker Environment Variables.
  4. Make changes to fit() method to supply the training data locations from S3.

Step 1: Upload data to Amazon S3

Below is the code for uploading the files to Amazon S3. Here we are using the default bucket provided by SageMaker but you can replace it with your own:

from sagemaker.s3 import S3Uploader

#Upload training data
S3Uploader.upload(local_path = "data/FashionMNIST/raw/train-images-idx3-ubyte.gz",
desired_s3_uri = "s3://"+bucket+train_prefix,
kms_key=None,
sagemaker_session=session)
S3Uploader.upload(local_path = "data/FashionMNIST/raw/train-labels-idx1-ubyte.gz",
desired_s3_uri = "s3://"+bucket+train_prefix,
kms_key=None,
sagemaker_session=session)
#Upload test data
S3Uploader.upload(local_path = "data/FashionMNIST/raw/t10k-images-idx3-ubyte.gz",
desired_s3_uri = "s3://"+bucket+test_prefix,
kms_key=None,
sagemaker_session=session)
S3Uploader.upload(local_path = "data/FashionMNIST/raw/t10k-labels-idx1-ubyte.gz",
desired_s3_uri = "s3://"+bucket+test_prefix,
kms_key=None,
sagemaker_session=session)

Step 2: Point to training data location from S3

Now, we create something called TrainingInput objects, which hold the S3 locations where the actual data is present. We create unique object for training and testing data respectively. The code looks like this:

from sagemaker.inputs import TrainingInput
train_input = TrainingInput(s3_data="s3://"+bucket+train_prefix)
test_input = TrainingInput(s3_data="s3://"+bucket+test_prefix)

Step 3: Define hyperparameters

The parameters you use in your training script can be split in two types:

  1. Parameters that you update rarely but are still worth being presented as parameters, like name of the S3 bucket where you store your data or Instance type used for the training.
  2. Hard-coded values and parameters that you update frequently with each new training job, like batch size, learning rate, number of epochs etc.

Let’s see what changes need to be made on the training script.

2. Changes to the training script

We need to make a number of changes to the training script. The changes can be divided in two parts:

Data handling related changes:

  1. Remove the code which downloads the dataset from PyTorch.
  2. Parse SM_CHANNEL_TRAIN and SM_CHANNEL_TEST environment variable values as command line arguments. These are the variables which hold the train and test data locations inside the training container.
  3. Add method to load and parse the dataset and convert it into PyTorch Tensor objects.
  4. Create our own custom class inheriting from PyTorch Dataset class, to hold the raw dataset and call method above to convert it to PyTorch Tensor.
  5. Use the custom dataset class created above to create the PyTorch Dataloaders.

Hyperparameters related changes:

We will be focusing on the parameters that change with each training job. For those that don’t change often you can place them in a json configuration file next to your training scripts.

We can supply the parameters as a command line argument and read it via ArgumentParser along with other environment variables coming from SageMaker (refer point 2 in the section above).

The following is an example where we add batch_size as a hyperparameter:

parser = argparse.ArgumentParser()
parser.add_argument("--batch_size", type=str, default=32)
args = parser.parse_args()
batch_size = int(args.batch_size)

As we’ll see in the next section, this will allow us to change the values of our hyperparameters across multiple training jobs without having to change the source code, package it and deploy it each time.

The rest of the training code remains the same. Check out the complete training code here to follow along.

NOTE: The only change SageMaker introduced here is in point 2, rest of the changes are independent of SageMaker and would be needed if you would like the training script to point to your own dataset rather than pre-baked ones provided by PyTorch. Adding the argument parser is also a good practice independent of SageMaker as it makes your scripts more modular and easy to execute.

3. Changes to SageMaker training API

The last step would be the make changes in the calling code for Training API to supply the S3 locations and hyperparameters. Hence we create the dictionary for inputs which contains objects pointing to S3 location of the dataset. And then we simply supply the dictionary to fit() method. For the hyperparameters, let’s add So let’s pass a json object containing our hyperparameters.

Have a look at the code below:

#Create the estimator object for PyTorch
estimator = PyTorch(
entry_point = "train.py", # training script
framework_version = "1.12", #PyTorch Framework version, keep it same as used in default example
py_version = "py38", # Compatible Python version to use
instance_count = 1, #number of EC2 instances needed for training
instance_type = "ml.c5.xlarge", #Type of EC2 instance/s needed for training
disable_profiler = True, #Disable profiler, as not needed
role = execution_role, #Execution role used by training job
hyperparameters={'batch_size': 64} #Hyperparameters
)

inputs = {"train":train_input, "test": test_input}
#Start the training
estimator.fit(inputs)

And that’s it. Now you can run your training on SageMaker using data from an S3 location and while controlling the values of the hyperparameters at run time . Check out the complete code for the blog in this github repository.

Conclusion

In this blog we saw how to use your own data by uploading it to S3 and pointing SageMaker to its location. We also saw how you can define and point to custom hyperparameters to offer flexibility and speed of iteration during the exploration phase.

As promised in part-1, we have covered everything apart from how to bring your own training container to SageMaker and use that instead of SageMaker provided PyTorch container. And that’s what we’ll cover in the next blog.

Want to be among the first ones to be notified when part-3 gets published? please follow the authors of this blog. We also plan to write a lot more “How-to” blogs on SageMaker, this year. So keep watching this space for more interesting content coming your way.

Twitter handles: @vikep0, @OHamzaoui1

--

--

Vikesh Pandey

Sr. ML Specialist Solutions Architect@AWS. Opinions are my own.