Why Bring Your Own Container to Amazon SageMaker and How to do it right !

Vikesh Pandey
6 min readJan 20, 2023

--

Authors:Vikesh Pandey, Othmane Hamzaoui

In part-1 and part-2, we showed how you can train you model easily on Amazon SageMaker while also providing your own data and configurations. Though Amazon SageMaker does provide containers for most popular deep learning frameworks like Tensorflow, PyTorch, MxNet, HuggingFace, XGBoost etc., there might be cases where you would like to use Bring Your Own Container(BYOC) to train your model.

In this blog we will cover the following areas:

  • When to use BYOC and when you don’t really need to BYOC.
  • Should you wrap the training code inside the container?
  • How to BYOC in SageMaker
  • What changes and what doesn’t, when you BYOC(Compared to using SageMaker managed container)

NOTE: This blog focusses only on training but you can BYOC for doing data processing and inference as well.

When to use BYOC:

  • If the SageMaker managed framework containers are not the ones you want to use.
  • If the SageMaker provided containers contain one or more dependencies which you don’t want in your training/inference environment. You can check the Dockerfile for all SageMaker managed containers here.
  • You have regulatory and compliance requirements to only use the containers you maintain.
  • You are supplying lot of dependencies(via requirement.txt) to SageMaker provided container and this installation is taking a long time while running the training job.
  • Your environment forces no internet connectivity to download packages.
  • You need strong reproducibility guarantees on the containers you use.

When you don’t really need BYOC:

  • The SageMaker managed framework (PyTorch, Tensorflow, etc.) containers and their versions fulfill your requirements. You can browse the list of SageMaker managed containers here.
  • You want to include a few third-party libraries. In this case, just place a “requirements.txt” in a local directory and reference it using the source_dir parameter in the SageMaker estimator. SageMaker will install those libraries at run-time. Warning: This method installs requirements.txt at the beginning of the training job, if the installation process is too long it might slow down your productivity.
  • You want to add additional scripts (beside your main training file). Similar to previous point, you can place them in the local directory and reference them using the source_dir parameter or dependenciesparameter in SageMaker estimator.

Do you include the training code inside the container?

No, you don’t have to. If you package your code in the container, every time you make code changes you will have to re-build it and this can impact your productivity. SageMaker provides a way to supply the training code external to the container, at run time, so you can just focus on iterating over your code while re-using the same container. If you still have a reason strong enough to put the training code inside the container, you surely can.

With that covered, let’s continue on how to BYOC in SageMaker.

How to BYOC in SageMaker

In this section, we will see how to build your own custom PyTorch container and use it to train the model. Using BYOC, you won’t need to make any changes at all to your training code. These are the steps we will follow:

  1. Build docker image — the same version(1.12) as the one we used in earlier example
  2. Push docker image on private Amazon ECR repository
  3. Use custom image for training — instead of SageMaker managed PyTorch container. Also, here you supply the training code external to container. More on this later.

1. Build docker image

Note: We are not going to cover docker basics here. But if you’re unfamiliar with docker, you can check the following tutorial Docker 101 Tutorial.

1.1 Build the Dockerfile

There are two ways to adapt your custom container to work on SageMaker.

  1. Using SageMaker provided Pytorch training toolkit — This is the preferred approach as the toolkit will set the following for you:
    – The locations (inside the container) for storing the code, model and other resources.
    – The entry point that contains the code to run when the container is started. Now you can either copy your training code in the docker image while building the image itself or you can supply your training script at run time from outside of the container.
    – Other environment variables and configuration that SageMaker requires use the docker for training.
  2. Without SageMaker provided Pytorch training toolkit — In this case, you’ll need to ensure that the docker adheres to the guidelines of this documentation.

We are going to use (and recommend) the first approach here.

Below is the sample Dockerfile:

from pytorch/pytorch:1.12.0-cuda11.3-cudnn8-runtime

RUN apt-get update && apt-get install gcc -y

RUN pip install sagemaker-pytorch-training

We picked the base image from the Official PyTorch repository with version 1.12 and installed the gcc as this dependency is needed by sagemaker-pytorch-training and finally we install sagemaker-pytorch-training.

1.2 Build the image

For building the image, you can choose either your local machine, or an instance/environment in cloud where Docker is installed. Just navigate to the directory containing the Dockerfile (created in 1.1) and run in your terminal:

docker build -t {Your_AWS_Account_ID}.dkr.ecr.
{Your_AWS_Region}.amazonaws.com/{Custom_Image_Name}:{tag} .

2. Push the docker Image to Amazon ECR

Next step is to push the docker image to a docker registry. SageMaker is natively integrated with Amazon ECR so we will push our image there. You can use your own private repository as well.

2.1 First authenticate to ECR

aws ecr get-login-password -region {Your_AWS_Region} | docker login -username AWS -password-stdin {Your_AWS_Account_ID}.dkr.ecr.{Your_AWS_Account_ID}.amazonaws.com

2.2 Create the ECR repository

aws ecr create-repository -repository-name "custom-pytorch-1-12"

2.3 And push the docker image to ECR

docker push {Your_AWS_Account_ID}.dkr.ecr.{Your_AWS_Region}.amazonaws.com/{Custom_Image_Name}:{tag}

That’s it ! We are now ready to use this image for training in SageMaker. For detailed instructions on pushing the image to ECR, please check Pushing a Docker image.

3. Use custom image for training

To use this image instead of SageMaker provided PyTorch image, we use the same PyTorch Estimator used in earlier blog posts. The only new argument here is image_uri. Apart from that, all other arguments are the same

There are no changes needed to the training code.

The calling code changes are shown below:

#Create the estimator object for PyTorch

from sagemaker.pytorch.estimator import PyTorch # import PyTorch Estimator class

estimator = PyTorch(
image_uri=custom_image_uri, #our custom pytorch image URI
entry_point = "train.py", # training script
instance_count = 1, #number of EC2 instances needed for training
instance_type = "ml.c5.xlarge", #Type of EC2 instance/s needed for training
disable_profiler = True, #Disable profiler, as it's not needed
role = execution_role, #Execution role used by training job
hyperparameters={'batch_size': 64}
)



inputs = {"train":train_input, "test": test_input}

#Start the training in the ephemeral remote compute
estimator.fit(inputs, wait=True)

What changed and what didn’t:

So let’s summarize how is this approach different from what was shown in Part-1 and Part-2.

What Changed

  • The execution container in which your training script is running. Now it’s your own container instead of SageMaker managed. Which also means you are responsible for patching and maintaining it. So your total cost of ownership on managing and running the container is higher.
  • We supplied our custom image URI via image_uri argument in the PyTorch Estimator.

What didn’t change

  • SageMaker managing the ephemeral training compute cluster. Spinning it up and shutting it down after training job is finished.
  • Container work directories, data, code and model output paths reserved by SageMaker. Thanks to SageMaker provided Pytorch training toolkit. So SageMaker runs your container the same way it ran its own managed ones. You can read more about it on Using the SageMaker Training and Inference Toolkits.
  • Training code. Your code is agnostic to whether you used managed container or BYOC.
  • All the arguments supplied to training code work the same way, again credit goes to SageMaker provided Pytorch training toolkit.
  • You supplied the training script external to container which is a huge benefit for iterative development. You could still wrap the training script inside your container but then for every code change you had to build the container again, which is time consuming. Remember, Time is money !! :)

Conclusion

In this blog, you learned how you can bring your own docker image to SageMaker and use it for training. The key highlight of this approach was that you could supply the training code external to the container, at run time. All the reference code of this blog is available in the Github Repository. For further reading, please checkout Using Docker containers with SageMaker.

Twitter handles: @vikep0, @OHamzaoui1

--

--

Vikesh Pandey

Sr. ML Specialist Solutions Architect@AWS. Opinions are my own.