Backing up and restoring SageMaker Notebook Instances

How to store and restore SageMaker Notebook instances to and from S3, for example for migration to Amazon Linux 2.

Jonathan Merlevede
datamindedbe
8 min readSep 20, 2021

--

This story lists three (four) options for storing and restoring the contents of persisted volumes attached to SageMaker Notebook Instances using only preinstalled tools. Such a “backup and restore” operation might be necessary in the context of migrating your notebooks to Amazon Linux 2

  • Option 1: AWS s3 sync
  • Option 2: Block-level copy using dd (optional encryption/compression)
  • Option 3: Tarball (optional encryption/compression)
  • Option 4: Re-create your environment

All of these can be combined with startup scripts for automation.

In the business of moving stuff (source)

For migration to Amazon Linux 2, I think option three/ an encrypted tarball backup is your best bet. Skip to there if you are in a hurry.

Amazon Linux 2

There are different reasons for wanting to backup and restore your SageMaker notebook instances. A particularly current one is the recently announced, long-awaited support for Amazon Linux 2. Even if you do not particularly care for its new features, you will likely want to upgrade instances over the coming months, as updates for instances running on Amazon Linux (1) will stop on April 18, 2022.

The upgrade path AWS offers is to stop, remove and re-create existing instances

The upgrade path offered by Amazon is to stop, remove and re-create existing instances. Unfortunately, this means losing the contents of your persisted (EBS) volume, as the volume is managed by AWS and its life tied to that of the instance it is attached to. Because you cannot simply detach and re-attach the volume, you need to perform a backup and restore.

Option 1: Backing up and restoring using aws s3 sync

One solution is to use aws s3 sync to copy the contents of your ~/SageMaker directory out and in to and from S3. To more-or-less automate the process, you can add lifecycle configuration scripts into the mix.

This is the solution suggested by Amazon in the context of Amazon Linux 2 migration in a nutshell. AWS provides a pair or lifecycle scripts calling on aws s3 sync that will back up and restore the contents of your volume on start. The procedure is described in detail in a post on the AWS blog, so I will not repeat it here:

Considerations

Although an awesome tool in general, using aws s3 sync is not suitable for all environments. Possible disadvantages in the context of notebook backups include:

  • Notebooks may contain many small files, resulting in lots of S3 operations and slow(er) syncs.
  • As S3 is not a Linux filesystem, file ownership and file permissions will not survive this procedure. Other metadata like the last modified timestamp will also disappear.
  • Hard-linked files will no longer be hard-linked after backing up and restoring. This may be a problem when persisting Conda environments, as Conda uses hard links wherever possible.
  • Persisted volumes may contain data the notebook’s users consider sensitive, while access to data on S3 may not be appropriately scoped. The AWS S3 CLI does not support client-side encryption.

The last point may be fairly arcane, but in my environment, many notebook instances run using the same execution role while access to notebooks is scoped to specific users.

Option 2: Backing up and restoring at the block level

An alternative option to aws s3 sync is to create and restore a block-level copy of the EBS volume to a single large S3 file, optionally encrypted. Hard links will remain in place, and backups can be written and restored at high throughputs.

If you want to automate migration using lifecycle scripts, you can start from the AWS lifecycle scripts linked to above and modify them slightly by replacing the aws s3 sync commands.

To store an encrypted image on S3

dev=`mount | grep SageMaker | grep "type ext4" | cut -d" " -f1`
siz=$(($(sudo blockdev --getsize64 $dev) + 1000000))
pw="mysecurepassword"
dst="mybucket/mylayer/mybackup.img.enc"
sudo mount -o remount,ro $HOME/SageMaker
sudo dd if=$dev bs=5M status=progress \
| openssl enc -e -aes-256-cbc -pass "pass:$pw" -iter 100000 \
| aws s3 cp - s3://$dst --expected-size=$siz
sudo mount -o remount,rw $HOME/SageMaker
  • Set $dst to be where you want to store the backup on S3
  • Set $pw to the password you want to use to encrypt the image with
  • The variable $dev resolves to the name of the block device where the SageMaker volume is stored (/dev/nvme1n1)
  • The variable $siz resolves to the size of $dev in bytes with a bit of buffer. As per the S3 cp docs, adding the --expected-size flag is required when an EBS volume that is larger than 50GB.
  • Make the volume read-only while creating the backup. You will see messages from Jupyter telling you that there were “errors while saving” open files while the backup is being created
  • Leave out the openssl step if you’re not interested in encryption; add a compression step (see below) if you’re interested in compression.
  • Size detection and the --expected-size flag are required only for large volumes (>50GB)

Restore the encrypted image from S3

src="mybucket/mylayer/mybackup.img.enc"
pw="mysecurepassword"
dev=`mount | grep SageMaker | grep "type ext4" | cut -d" " -f1`
sudo mount -o remount,ro $HOME/SageMaker
aws s3 cp s3://$src - \
| openssl enc -d -aes-256-cbc -pass "pass:$pw" -iter 100000 \
| sudo dd of=$dev bs=5M status=progress
sudo mount -o remount,rw $HOME/SageMaker
  • Set $src to where you stored the backup ($dst in the backup command)
  • Set $pw to the password you used to encrypt the image

If your source image was created on Amazon Linux (1) and you are restoring to Amazon Linux 2, you will want to change file ownership, as ec2-user has user ID 500 on AL1 and UID 1000 on AL2:

sudo find $HOME/SageMaker -uid 500 -exec chown ec2-user: {} \;

Considerations

None of the disadvantages cited for aws s3 sync apply for block level backups. However, these come with their own set of limitations:

  • You cannot do incremental backups
  • The source and destination volumes have to be the same size.
  • The size of your backup will always be the size of your volume, and not just the part of it that you actually use. This can make the procedure slow and inefficient if notebooks are attached to oversized volumes.
  • The disadvantage above is exacerbated by the fact that reads to parts of the EBS volume that have never been read from (or written to) are quite slow (20–30MB/s versus 140MB/s for parts previously read). This is behavior I observed (September 2021, Paris region) but did not expect, as EBS volumes that are created empty should no longer have to be pre-warmed or initialized.

Option 3: Backing up and restoring using a tarball

A third option is to stream the contents of the EBS volume into a single, large, optionally encrypted and optionally compressed TAR file (tarball) stored on S3.

To store an encrypted, compressed tarball on S3

dst="mybucket/mylayer/mybackup.tar.gz.enc"
pw="mysecurepassword"
dev=`mount | grep SageMaker | grep "type ext4" | cut -d" " -f1`
siz=$(( $(df -B1 $dev | tail -1 | tr -s ' ' | cut -d ' ' -f3) * 115 / 100 ))
sudo mount -o remount,ro $HOME/SageMaker
cd $HOME/SageMaker
find -P . -mindepth 1 -xdev -type d \( -name "lost+found" -o -name ".vscode-server" -o -name ".cache" \) -prune -o -print0 \
| sudo tar --null --files-from - -cf - \
| dd status=progress \
| pigz \
| openssl enc -e -aes-256-cbc -pass "pass:$pw" -iter 100000 \
| aws s3 cp - s3://$dst --expected-size=$siz
sudo mount -o remount,rw $HOME/SageMaker
  • Adapt the find command to include or exclude the files you don’t need.
  • Set $dst to be where you want to store the backup on S3
  • Set $pw to the password you want to use to encrypt the tarball with
  • The -xdev flag prevents copying of files that are not stored on the persisted volume (e.g. when you mount remote filesystems on sub-paths of ~/SageMaker)
  • We make the volume read-only while creating the backup. You will see messages from Jupyter telling you that there were “errors while saving” open files while the backup is being created
  • Leave out the openssl step if you’re not interested in encryption; leave out the pigz compression step (see below) if you’re not interested in compression.
  • Size detection and the --expected-size flag are required only for large volumes (>50GB)

To restore an encrypted, compressed tarball from S3

src="mybucket/mylayer/mybackup.tar.enc"
pw="mysecurepassword"
cd $HOME/SageMaker
aws s3 cp s3://$src - \
| openssl enc -d -aes-256-cbc -pass "pass:$pw" -iter 100000 \
| pigz -d \
| dd status=progress \
| sudo tar -xf - .
  • Set $src to where you stored the backup ($dst in the backup command)
  • Set $pw to the password you used to encrypt the tarball

If your source image was created on Amazon Linux (1) and you are restoring to Amazon Linux 2, you will want to change file ownership, as ec2-user has user ID 500 on AL1 and UID 1000 on AL2:

sudo find ~/SageMaker -uid 500 -exec chown ec2-user: {} \;

Alternatively, do not run the tar -xf command as root; then, all extracted files will be owned by the user running the tar command (~ec2-user). The approach above will keep root-owned files as root-owned, so I consider it marginally better.

Considerations

Creating a backup in the form of an encrypted tarball does not have the disadvantages listed for s3 sync. Tar files will preserve hard links, file permissions and even file ownership. Compared to the block-level sync, the TAR approach has the advantage that it can be used to reduce the size of your EBS volume and will be faster and cheaper in case of oversized volumes.

  • Cannot be used to do incremental backups
  • Without compression, TAR backup speeds are IO-bound. With compression, they are CPU-bound. See below for a note on compression.

I think tarring your EBS volume is the best way to implement migration to Amazon Linux 2.

Note on compression

I quite arbitrarily chose to apply compression in option 3 and not in option 2. Both options can be implemented with and without compression, and with and without encryption.

Compression significantly reduce the size of a backup, but also slows down the speed at which backups are created. Use different encryption algorithms to strike a different balance between speed and size (e.g. xz, lz4).

  • Choose xz for the smallest sizes, but slow compression
  • Use lz4 for the fastest compression speeds, but larger backups

Gzip compression with pigz happens at about 30MB/s on a small ml-t3-medium instance, and in my opinion strikes a good balance between speed and compression ratio. You may expect at least a 30% reduction in size.

The speed at which backups are decompressed is always much faster as the speed at which they are compressed.

Option 4: Re-create the environment

The instance’s persisted contents should almost always also be saved outside of the EBS volume:

  • Data files should probably be on S3
  • Code (including notebooks) should probably be in version control.
  • A description of the environment required to run the code should be saved with the code (e.g. in the form of a requirements.txt file).

If this is the case, additional backups may be somewhat superfluous. Think of instances as transient, and don’t spend too much time on backing them up.

For migration as required for Amazon Linux 2, a backup and restore might be quicker and easier than setting up your environment again, but it should not be necessary.

--

--