Data Engineering Learnings: Retaining Files in Sagemaker Notebook Servers

Published in

In the weeds

4 min readJan 13, 2020

Last year at Greenhouse, we doubled the size of our Data Science team. With a larger team, we’re looking to tackle larger problems, provide more insights, and ship more machine learning related products in the near future. And as part of that expansion, the Tools and Operations team have taken a large focus on Data Engineering work. We have been building out more tools and optimizing workflows to maximize the generation of that delicious data work. This is a quick little story about a small detour we took when working with AWS Sagemaker Notebooks.

A large part of Data Science development is running various experiments with our data. These can be building simple charts, doing complicated Bayesian statistics, or even prototyping ML models for making various predictions. The preferred tool of choice these days are Jupyter Notebooks, which provide a REPL and some nice graphical tools that can make this experimentation process faster for our Data Scientists. I won’t go into much detail about these, as this story isn’t so much about the notebooks.

As good stewards of data, we have to make sure that these experiments are done in a responsible manner. We’re dealing with real data, so we need to build a secure environment for these Notebook servers to run in; one with the correct security and auditing tools in place to keep the data safe. Being an AWS shop, we choose to work with Sagemaker Notebook servers, which got us 85% of the way there. Beyond the branding, a Notebook server is just an EC2 instance with a Jupyter Notebook server running.

We customized our Notebooks servers with a Github integration, some standard aliases, and various checks and protections to make sure no data leaks out into the world. We built all of this customization using Lifecycle Hooks, tested it all, and turned over to the Data Science Team. And that’s where they encountered the problem.

Because our Data Scientists don’t work 24 hours a day, there are many times when the notebook servers are doing nothing. These aren’t web servers, which need to be on for any customer at any time of the night. They’re dev machines, and only need to be on when crunching the numbers. To save the planet (and money) our team turns off the notebook servers when they are not needed. On Friday afternoons, we send out a slack message to the team, who then shut down their notebooks, and will boot them back on Monday morning.

But that first Monday back, something funny happened. The Data Science team started getting errors from git when they tried to commit changes they were making to their notebooks. Git would complain that our pre-commit binary, which prevents data from being checked in, was missing. I immediately opened a terminal on the instance, ran the which command, and sure enough, the binary that we installed with our hooks was just gone. I was left scratching my head.

We began by examining our lifecycle hooks, which we used to set up the server. AWS Sagemaker Notebook Servers support two types of Lifecycle hooks: Start Notebook and Create Notebook. The Create Notebook hook runs only a single time when the notebook server is created, and the Start Notebook runs every time the notebook server is booted. We set our scripts to work on the Create Notebook hook since we only needed to install the binaries once the machine was built. After all, turning off the server shouldn’t erase the file, should it? In fact, I had tested this. I had created a sample notebook on a Notebook server, turned it off and on again, and found the notebook still there. None of this made sense.

But then it occurred to me that maybe turning off was a Notebook Server wasn’t analogous to just powering down a machine. I opened a terminal on the Notebook Server and did some investigation. On Notebook Servers, all of the notebook files are stored in the /home/ec2-user/Sagemaker directory. I decided to try an experiment. I put three empty files into different directories: /, /home/ec2-user and /home/ec2-user/Sagemaker , and then rebooted the machine to see if any of those files would disappear. Sure enough, the files in / and /home/ec2-user disappeared while the file in /home/ec2-user/Sagemaker remained.

It turns out that the Sagemaker service doesn’t preserve the entire machine image when you turn off the Notebook; it only saves the notebook files which are all under /home/ec2-user/Sagemaker . This actually makes a lot of sense. If AWS wants to make a change to their machine image for Sagemaker, they don’t have to modify their user’s machine images. They just preserve the one directory (the one with the notebooks) and then they just build the whole machine fresh each time. If you want to apply an update, just reboot all the user’s machines.

The solution was simple - move all of our hooks to the Start Notebook Lifecycle hook. Now we treat every boot of a notebook server like a fresh install. It just goes to show you that no abstraction is perfect. We made the assumption that the AWS machine was like any server on their platform because that fit the mental model. But sometimes you have to dig down in the details to figure out what’s going on.

Data Engineering Learnings: Retaining Files in Sagemaker Notebook Servers

Written by Orion Delwaterman