Customizing SageMaker Notebook Instances

How SageMaker’s lifecycle configuration works, a collection of useful startup scripts and how to bundle them

Published in

datamindedbe

9 min readNov 26, 2021

Sagemaker Notebook Instances are reset to their original state every time they are started. “Persistent” configuration is possible through lifecycle configuration, a pair of scripts run on creation of the instance and every time the instance is started.

If you already know well how lifecycle configuration works, you can skip over the first section. The second section lists some configuration scripts that I think are useful. The third section describes how I bundle multiple configurations into a single script and prevent them from getting too messy; I believe this is novel content.

There already exist many good posts and articles about lifecycle configuration, but I hope that this fairly in-depth addition to the heap proves helpful. Let me know, e.g. by clapping, if it is :).

The SageMaker Notebook Instance lifecycle

SageMaker Notebook Instances are reset to their original state every time they are started. The only persistent state is an AWS-managed EBS volume mounted at /home/ec2-user/SageMaker during startup. This path is shown as the root folder of the Jupyter Notebook interface. You may configure and install applications on the instances as you please (e.g. using yum), but this all disappears when restarting the instance.

AWS maintains the base image. Although the resetting has disadvantages, it does mean that you get a reproducible environment, security patches and package upgrades automatically simply by restarting your instance.

Lifecycle configuration scripts

Making “persistent” changes to the instance is possible through lifecycle configuration scripts. There are two such scripts, one that runs only once at creation of the instance (on-create), and one that runs every time the instance is started (on-start). Usually, on-start is more useful than on-create.

Both scripts are limited in length: they can not be longer than 16384 characters. The length of a script is measured after base64-encoding, which incurs about a 33% overhead. This means that there are actually only about 12288 useable characters.

The length of the scripts is measured after base64-encoding

Both scripts are limited in execution duration: they cannot run for longer than 5 minutes. If this does not suffice or if you want instances to start faster, run the scripts in a non-blocking way, e.g. by using nohup or by executing them in a screen instance.
Your instance state changes from “starting” to “running” only after the scripts return successfully (that is, with exit code 0).
If the scripts do not exit successfully, the instance will not start and the state will become “failed”. Startup does not “fail fast” — you have to wait for the 5 minute timeout before your instance state changes from “starting” to “failed”.
At creation of an instance, both scripts will run.
When running your instances as part of your VPC on a private subnet, the instance will not be able to connect to CloudWatch. You will have to debug your scripts without any logging output.

Some good practices

Both on-create and on-start run as root. This allows for full control of the instance, but you may want to run commands as ec2-user or chown files you create to ec2-user to ensure appropriate ownership.
Prefer on-start over on-create. E.g. when downloading a file, consider scripting a “file does not exist” condition over downloading the file from on-create, so that deletes of the file are reset when restarting the instance.
Minimally depend on external services in your lifecycle scripts (e.g. for hosting script files); if you do make sure the exit code of your scripts remains successful (e.g. by adding || : to your commands).
The lack of logging combined with the fail-after-timeout behavior makes development of lifecycle scripts tedious. To get quicker feedback, run scripts on an already-running instance first as explained here.

Developing lifecycle configuration scripts is tedious: cycles are long, and you may only get binary “success” as feedback.

Useful configuration scripts

This is an overview of excerpts of on-start (Bash) scripts that I have found to be useful. I use only on-start configuration.

Some of the snippets below should be run as root, but most can (and should) be run as ec2-user through sudo (e.g. sudo -H -i -u ec2-user …). Which should be clear from context.

Persisting conda environments

When creating a new conda environment, a reasonable expectation is for it to still be there after restarting your instance. By default, this is not the case.

To enable persistent conda environments, you can have conda look for environments in multiple directories by adding an envs_dirs configuration to .condarc. So, add a location a directory on the persisted volume (e.g. SageMaker/.conda/envs) in addition to the default location ($HOME/anaconda3/envs). List the persistent location first, so that conda creates new environments there by default.

mkdir -p "$HOME/SageMaker/.conda/envs"
cat << EOF >> $HOME/.condarc
envs_dirs:
  - $HOME/SageMaker/.conda/envs
  - $HOME/anaconda3/envs
EOF

You likely also want these custom environments to be registered to ipykernel when the instance starts, to show them in the UI:

for env in $($HOME/anaconda3/condabin/conda env list | grep SageMaker/.conda | cut -f 1 -d " "); do
  "$HOME/SageMaker/.conda/envs/$env/bin/python" -m ipykernel install --user --name "$env" --display-name "$env" || :
done

The || : is there to ensure that environments in which ipykernel is not installed, do not cause unsuccessful exits of the startup script.

Stop the SSM agent

Notebook instances run in an AWS-managed account. The role you assign it is assumed by the processes running on it through undocumented black magic. However, this black magic does not always work. Specifically, it does not work for SSM agent, at least not when you connect the AWS instance to your own VPC as we do.

Black magic aka iptables DNAT rules redirect IAM credential requests to the metadata server (at 169.254.169.254) to a custom, AWS-managed endpoint that masquerades as the metadata server, but returns temporary credentials for the role you configured. Presumably after securely verifying your instance’s identity :).
As pre-configured, the NBI currently uses its own, uncredentialled role, which simply does not have access to your account’s SSM APIs. Changing configuration so that the SSM agent uses role you configure for your instance trips up SSM too; it denies access because “the caller’s identity not matching the identity of the credentials”.

I don’t like the idea of the SSM agent — running on the instance by default — constantly reaching out only to be painfully rejected by AWS. Such a life is not worth living. So, I swiftly execute (stop and uninstall) SSM agent on start:

ssm_status=$(status amazon-ssm-agent || systemctl status amazon-ssm-agent 2>&1)
if [[ "$ssm_status" =~ "running" ]]; then 
    stop amazon-ssm-agent || systemctl stop amazon-ssm-agent
    yum erase amazon-ssm-agent --assumeyes
else

Uninstalling takes some time; you can just leave ssm-agent installed too.
The snippet above “supports” both System-V syntax (status, stop) and systemctl syntax (systemctl status, systemctl stop) for compatibility with instances running on Amazon Linux and Amazon Linux 2.

Persisting configuration and SSH keys

Many applications store small configuration files in ~/.config . It is nice for such configuration to survive reboots. The same goes for SSH keys.

PHOME="$HOME/SageMaker"
for dir in ".config" ".ssh"; do
    [ ! -d "$PHOME/$dir" ] && mv $HOME/$dir $PHOME/$dir
    [ -d "$HOME/$dir" ] && rm -r "$HOME/$dir"
    ln -s $PHOME/$dir $HOME/$dir
done

Pre-installing binaries

It’s nice for there to be a place to install binaries to that’s pre-configured to be part of the system’s $PATH :

mkdir -p $HOME/SageMaker/.bin
cat << "EOF" > /etc/profile.d/my-path.sh
if [ -d "$HOME/SageMaker/.bin" ]; then
    export PATH="$HOME/SageMaker/.bin:$PATH"
fi
EOF
chmod 644 "/etc/profile.d/my-path.sh"

You can pre-install binaries in here too:

umask 002
if [ ! -f $HOME/SageMaker/.bin/mybin]; then
   wget https://mydomain.mytld/mybin -O $HOME/SageMaker/.bin/mybin
fi

Environment variables

There’s a number of environment variables that I think are useful to set globally. For example, I pre-install rclone, and configure sensible defaults (for my environment) using environment variables as follows:

cat << EOF > /etc/profile.d/my-env.sh
export RCLONE_S3_ENV_AUTH="true"
export RCLONE_S3_REGION="eu-west-3"
export RCLONE_S3_SERVER_SIDE_ENCRYPTION="aws:kms"
export RCLONE_S3_NO_CHECK_BUCKET="true"
export AWS_DEFAULT_REGION="eu-west-3"
EOF
chmod 644 "/etc/profile.d/my-env.sh"

Autostop

Idling notebooks are a fact of life, and people will forget to shut their instances down. You’d think auto-shutdown would be a built-in SageMaker NBI feature, but it is not. Luckily, AWS does provide a script, autostop.py.

The autostop.py script queries connected kernels for the time of their last activity through the Jupyter API and shuts down the instance if no kernel was active for a configurable amount of time. The only thing you have to do, is run it regularly:

(crontab -l 2>/dev/null; echo "*/5 * * * * /usr/bin/python /startup/autostop.py --time 3600 --ignore-connections") | crontab -

You do have to get the autostop.py script on the instance (above, at /startup/autostop.py) using curl or wget as suggested, or by bundling it with other startup scripts as outlined below. I prefer to embed the script in the on-start.sh, as the dependency on Github being up is unnecessary.

The autostop.py script will only work when the role attached to the instance has the permission to query its own last modified date (SageMaker:DescribeNotebookInstance) and to shut itself down (SageMaker:StopNotebookInstance).

Bundling scripts

You can only submit a single on-start (and on-create) script to a lifecycle configuration on AWS. You can only attach a single configuration to an instance. There’s more than one script above. You can find many more on-line, all called “on-start.sh”, seemingly all expecting to run in splendid isolation from other startup scripts. Some scripts you want to run as root, but others you don’t. Creating one big script is tedious.

Instead of creating a single humongous script, I put different scripts into a local directory, say $script. I encode the contents as a base64-encoded, compressed tarbal as follows:

enc=$(sh -c 'cd $script && tar czf - .' | base64 -w 0)

Then, I generate the on-start.sh script by inlining $enc, decompressing and decoding it:

cat << EOF > on-start.sh
mkdir -p /startup
echo "$enc" | base64 -d | tar xzf - -C /startup
sudo chown -R root: /startup
/startup/stop-ssm.sh || :
sudo -H -i -u ec2-user /startup/persist-config.sh || :
EOF

The decoding and untarring of the inlined tarbal makes all the scripts in $script available inside of the /startup directory on the NBI. In the example, I run stop-ssm.sh as root, while running persist-config.sh as ec2-user.

The || : is there to ensure a successful return status for the script, because unsuccessful script executions will prevent the instance from booting. Whether the exit code of the child process impacts that of the parent depends on your bash settings, but I tend to run mine using "robust” settings, set -exuo pipefail. I consider being explicit and adding the “or true” statement if I want to ensure success a good habit.
Splitting up the script can help with development / debugging, as you can more easily run the different parts of the scripts in isolation.
Although base64-encoding adds overhead, this inefficiency is more than offset by the gzip operation. Encoding like this allows for longer startup scripts. You can probably use more efficient encodings too— let me know in the comments if you tried this!

Deploying scripts

Lifecycle configuration is easily updated through the AWS CLI; you do have to know to base64-encode your on-start.sh and on-create.sh scripts:

aws sagemaker update-notebook-instance-lifecycle-config \
    --notebook-instance-lifecycle-config-name "$LIFECYCLE_CONFIG_NAME" \
    --on-start Content="$((cat on-start.sh) | base64 -w 0)" \
    --on-create Content="$((cat on-create.sh) | base64 -w 0)"

The command for creating a new lifecycle configuration is similar.

Of course, you can also set the configuration using the console, using the aws_sagemaker_notebook_instance_lifecycle_configuration resource in Terraform, ….

More references

AWS hosts a collection of lifecycle configuration scripts and useful instructions on their GitHub account:

GitHub — aws-samples/amazon-sagemaker-notebook-instance-lifecycle-config-samples: A collection of…

A collection of sample scripts to customize Amazon SageMaker Notebook Instances using Lifecycle Configurations…

github.com

I also found the following page to have a useful collection of SageMaker tips and tricks:

model.predict

Yes, you already know Jupyter notebooks enable bad code design. They often have some hidden state, you execute cells…

modelpredict.com

I work at Data Minded, an independent data engineering and data analytics consultancy based in Leuven. Contact us if you’d be interested in working together!