Move from local jupyter to Amazon SageMaker — Part 1

9 min readJan 6, 2023

Authors: Vikesh Pandey, Othmane Hamzaoui

Today, Jupyter notebooks is one of the go-to IDE for most Data Scientists(DS) running their ML experiments. Generally, you install Jupyter Server on your local machine or your IT server (in the cloud) and get started. While this gives you independence, often you are restricted by the compute power of the underlying environment (local/cloud) on which the Jupyter Server is running.

Whether it’s running a model that requires GPU or training with datasets so large, it fills up the local memory or even simply running more experiments, you often hit a limit of what you can do with your local machines or IT servers.

To improve this experience, in this blog, we try to break down the steps you need to perform to move from your local Jupyter environment to a scalable managed machine learning environment in the cloud, like Amazon SageMaker.

We will divide the blog in following sections:

Overview of Amazon SageMaker
The challenges faced during migration
Prescriptive guidance on how to solve the challenges

1. Overview of Amazon SageMaker

Amazon SageMaker provides fully managed solutions and APIs covering every step of the machine learning lifecycle as we as the operational aspect (i.e. MLOps). The aim is to let you focus on accelerating Machine Learning Iteration and experimentation without worrying about underlying infrastructure and environment. SageMaker also comes with its own IDE: SageMaker Studio. It provides a single pane of glass for performing various steps in the ML lifecycle. Its built on top of JupyterLab interface, customized using specific widgets that allows you to interact with the various APIs of SageMaker.

What is the difference between SageMaker and SageMaker Studio?

SageMaker is a set of APIs to perform various actions in ML lifecycle. Whereas SageMaker Studio is a UI tool/IDE to access most of those APIs while also providing a managed ML environment.

Simply put, you can use SageMaker APIs without using SageMaker Studio IDE. In the blog as well, we have specifically mentioned SageMaker and Studio separately, wherever necessary.

2. The challenges faced during migration

Disclaimer: The challenges DS face here are related to the shift of paradigm between doing ML locally and in the cloud, they are not specific to SageMaker but how one is able to run ML in the cloud vs local. With that said, lets proceed.

So now you are excited to try out SageMaker. You get into AWS account, onboard to SageMaker Studio domain and launch the SageMaker Studio IDE. But this is where the challenges begin. Lets take a look at the “Training” part of the ML lifecycle and break down the challenges:

The Mystery: When developing locally, the DS had full access to training infrastructure (their laptop/cloud instance). Everything ran in one place. They could see where the training is happening. They could connect into the training environment, debug and have full control and accessibility. Whereas In case of SageMaker, different parts of ML (training, tuning, deploy, monitor etc.) run in different execution environments. For Example: the training happens on a SageMaker managed training cluster which spins up on-demand and tears itself down once the training job finishes. And the model, when deployed, is hosted in a different compute. The various runtimes can sometimes be confusing.
The Model: While working on local, the training code just saves the model to local directory of your choice. Where does it go on cloud?
The Data: While the DS run their experiments on the local machine, most of the time, their data is sitting on the local machine or some shared data lake server itself and they just point training code to those paths. How would it work in the cloud?
The Configurations: On local, DS supplies the environment configurations and hyper-parameters, right in the script itself as command line arguments or as a JSON file. How would it work on cloud?
The DIY Dilemma: SageMaker provides fully managed containers for popular frameworks like Tensforflow, PyTorch, HuggingFace etc. but what if you want to use your custom containers for training? How would it work in cloud?

3. Prescriptive guidance on how to solve the challenges

As part of the solution, we will focus on the first 2 challenges in this blog and tackle the next two in the part-2 of this blog. The last challenge(Point 5 above) deserves a blog of its own and that will be the part-3, the finale.

So lets talk about how to solve these problems. We will discuss two different ways to solve the issues.

Run as-is: Often also called “the shortcut”. Easy to perform but very inefficient and does not give benefits of running the code in cloud.
Cloud native: Slightly more work, but worth the effort, given the cost-efficiency and scalability it will provide.

3.1. Run as-is

In this case, DS spin up the SageMaker Studio IDE, launch a notebook, copy-paste their local training code in the notebook cells and runs it. No other changes needed !!

Lets take this sample notebook from PyTorch QuickStart tutorial. You can run this notebook as-is in SageMaker Studio if you choose the right image/kernel in Studio. We tested the notebook with PyTorch 1.12 Python 3.8 CPU optimized image with Python3 Kernel in Studio and it ran without errors. Have a look at the screenshot below to help choose the right image and kernel in Studio:

And why its not the right way..

The issue in this case is that you just migrated your ML code to the cloud but you didn’t use most of the benefits of ephemeral execution environments provided by SageMaker and the cloud based IDE. You might also end up paying more by using bigger instances having more RAM, CPU and GPUs but will be faced with a high idle time of the compute resources. The below illustration explains the same:

One of the key benefits of using the cloud is the “pay as you go” pricing model or in other words, “pay for what you use only, not more”. Assuming you’re working 8 hours a day, the CPU is used most of the time whereas the GPU is only is only used for 2 hours (25% of the instance uptime). In other words 75% of the costs generated by the GPU went into thin air.

3.2 Cloud native

To take advantage of the cloud’s elasticity, you could use smaller compute when writing, debugging or testing code on small samples and use larger on-demand compute clusters provided by SageMaker, when executing large data processing, training or inference. It would also be quite cost effective as you use the right compute for the job. However to achieve this separation of runtime, you need to introduce a bit of modularity and structure to our code base.

In the following example we’ll showcase how you can move from a typical PyTorch training example that you can run locally as-is to a cost-efficient and scalable training in the cloud.

We will make use of the SageMaker managed training API for PyTorch. The API allows to launch the training on a remote ephemeral compute cluster which is fully managed by SageMaker while only paying for the duration of training. Let’s go through the changes required:

First simply copy all the code cell-by-cell in a python script. Or use nbconvert do that automatically.

Note: ensure to remove the model prediction code from the example as we are focusing on training only for the sake of this example.

For now let’s keep the data loading code as-is, reading from PyTorch DataSet APIs. We will change this in part-2 of this blog.
Since the training is happening in an ephemeral compute job, we need a way to access the trained model even when the instance is terminated. When SageMaker spins up a training job, it pulls a (PyTorch) container (since this example is of PyTorch) and sets up a list of environment variables. One of these environment variables is SM_MODEL_DIR, a local path where every file that’s present will be packaged and pushed to an S3 bucket of your choice. We’ll save our model there :

#add couple of imports to the beginning of the script
import os
import argparse
#immediately after imports, add the code to load the environment variable
parser = argparse.ArgumentParser()
parser.add_argument("-model-dir", type=str, default=os.environ["SM_MODEL_DIR"])

And then towards the end of the script instead of:

torch.save(model.state_dict(), "model.pth")

Change it to:

path = os.path.join(args.model_dir, "model.pth")
torch.save(model.state_dict(), path)

You can also have a look at this reference training script having the same changes.

Now, we need a way to run the training script. To do that, SageMaker provides a PyTorch Estimator which contains info like where is the code located, what infrastructure to run the script on, etc. SageMaker provides fully managed containers for popular frameworks like TensorFlow, PyTorch, MXNet, HuggingFace etc. Interested readers can read more here. Below is the code you need to put in the same notebook where we ran the example as-is. Just add new cell and paste the following code:

import sagemaker # importing sagemaker python SDK
from sagemaker.pytorch.estimator import PyTorch # import PyTorch Estimator class 
from sagemaker import get_execution_role # import fn to fetch execution role

#Store the execution role. 
#Here the same role used which was used to create a sagemaker studio user profile
execution_role = get_execution_role()

#Using the PyTorch estimator tells SageMaker to use an AWS provided PyTorch container
estimator = PyTorch(
    entry_point = "train.py", # training script
    framework_version = "1.12", #PyTorch Framework version, keep it same as used in default example
    py_version = "py38", # Compatible Python version to use
    instance_count = 1, #number of EC2 instances needed for training
    instance_type = "ml.c5.xlarge", #Type of EC2 instance/s needed for training
    disable_profiler = True, #Disable profiler, as not needed
    role = execution_role #Execution role used by training job
)

#Start the training
estimator.fit()

The above code will run the training job on an ephemeral training compute instance created and managed by Amazon SageMaker. The instance will be auto-terminated when the job finishes. The model produced will be wrapped in a tar.gz and stored in a default S3 path created by SageMaker training job, but you can override that path by providing your own Amazon S3 location as one of the arguments in PyTorch Estimator.

Another thing to call out is, the model created is SageMaker agnostic. You can host/deploy this model anywhere even outside SageMaker.

Is there is middle way than leaving my local environment?

If you still want to keep working in your local jupyter environment, you can. Just setup you AWS credentials on your local environment and use the SageMaker SDK there.

So what did you gain by doing these changes?

Your training is now more modular allowing you to run it in a different environment (compute instance) than the one where you write the code (i.e: The jupyter IDE). Now you can run multiple training jobs in parallel, each with different compute and memory requirements.
It gives you flexibility to run notebooks on a less powerful and run training on more bigger instances, if needed.
Which leads to keeping costs at minimum by paying only for what you use, and scaling easier by running multiple trainings in parallel.
No more idle resources. No longer need to pay for GPU instances all day just to run a 2 hours training job
There are many more benefits of using SageMaker Studio notebooks. Interested readers can Dive deep into Amazon SageMaker Studio Notebooks architecture.

Summary and next steps

To summarize, in this blog, you learned how can you move your training code from your self hosted jupyter environment to SageMaker and use SageMaker IDE as your jupyter environment. All of the code related to the blog can be accessed in this git repository.

But, we did not cover few things in this blog:

How about i provide my own data locations? How would that work with SageMaker?
How about other environment specific configurations? Hyperparameters? How would that work?

That’s what we are going to cover in Part 2. Read on to continue your journey into SageMaker.