Learning Machine Learning on the cheap: Persistent AWS Spot Instances

The bill came in on a cold, rainy November morning. “I could have bought a half decent GPU with this” I though. Renting a top notch GPU from Amazon is great for working on ML problems, but expensive. Here is what I learned on my quest to slash costs by 80%.

Table of Contents

1. Run a Spot Instance for ML
1.1 Tools Needed
1.2 Virtual Private Cloud (VPC)
1.3 Create the Instance
1.4 Login and test
2. Persistence for Spot Instances: Approach 1 — Attached volume
2.1 Create a volume
2.2 Attach the volume to the instance
2.3 Mount the volume to the instance
3. Persistence for Spot Instances: Approach 2 — Swap root volume
3.1 With a new instance
3.2 Using an existing instance
4. Stop a Spot instance

In general, Amazon Web Services (AWS) offers several different ways to use a virtual machine (instance) in the cloud:

  • On Demand Instances — Rent an instance with specific capacity (CPU, memory, etc.). When no longer needed, you can power it down (“stop” it). Later you can “Start” it again and pick up where you left off. Only running instances are billed.
  • Reserved Instances — You commit to using and paying for a number of instances for a certain period of time (usually 1–3 years). About 50% cheaper than On Demand Instances.
  • Spot instances — Use the spare computing capacity that Amazon has at any given moment. You bid on this capacity and usually get it for much cheaper than On Demand and even Reserved instances.

Looking at my AWS bill, it was clear that the greatest cost was actually running the instances. A distant second was storage cost. This is pretty fortunate, since we have much cheaper Spot instances available if we are willing to work around their quirks. How much cheaper Spot instances are vary, but I regularly use about P2 spot instances at about 70–80% savings (see table below).

Spot Instances Caveats

Spot instances are much cheaper but come with a few caveats:

  • Spot instances can not be stopped and started again, they can only be terminated. This is a big one. Terminating an instance destroys it, so when you start a new instance to resume work, you lose any changes you’ve made. You can keep the volume/disk that the instance was operating off but you can’t easily start a new instance out of it.
  • You might be outbid at any time. When you are outbid your Spot instance is terminated automatically. However in several months of using Spot Instances, I only had this happen once.

Since I decided I’m going to work with Spot instances, a few different approaches to circumventing their weaknesses came about.

1. Run a Spot Instance for ML

First things first. Let’s learn how to create a spot instance where we will be able to develop and run ML models. We want to use P2 instances. They come with one or more powerful NVIDIA K80 GPUs with lots of memory (11 GB) to test and train your models on. P2 comes in 3 sizes:

Let’s see how we can actually get one ourselves.

1.1 Tools Needed

  • Install AWS cli. AWS cli is a command line utility that can be used instead of the web-based AWS Console to manage AWS services.
  • Then run aws configure to set your key, secret and region. Regions that AWS supports P2 instances in are N. Virginia (us-east-1), Oregon (us-west-2) and Ireland (eu-west-1). Usually you want to choose the region that’s geographically closest to you.
  • Finally, we download helper scripts, that will assist us in the setup:
git clone --depth=1 https://github.com/slavivanov/ec2-spotter.git

1.2 Virtual Private Cloud (VPC)

Before we can start any P2 instances, we need to setup a Virtual Private Cloud (VPC). Which is just a fancy virtual network to launch your virtual machine in. Setting up a VPC can be a little intimidating. It certainly was for me when I first did it, and the details are still a bit fuzzy. Good news is it has to be done only once. One way to approach this is to follow Amazon’s guide.
A better approach would be to use scripts adapted from Fast.ai’s course Deep Learning For Coders. If you got the helper scripts from Needed Tools above, simply run the following:

. ec2-spotter/fast_ai/create_vpc.sh

This will create a VPC, Internet Gateway, Subnet, Route Table, Security Group and most importantly a Key Pair. We will use the newly created key (located at ~/.ssh/aws-key-fast-ai.pem)to connect to the instance we are about to create. It will also print the ID of our newly created Subnet and Security group. We’ll need these for the next step.

1.3 Create the Instance

We can follow Amazon’s instructions for launching a spot instance. But because we are cooler than that, we could use a little helper script named start_spot_no_swap.sh to launch the instance.

We need to pass it the following arguments:

  • ami —Depending on which region we have picked and whether we want to use Fast.ai image or the Amazon one, we need to select an image:
  • subnetId — Use the subnet ID that create_vpc.sh printed.
  • securityGroupId — Use the security group ID that create_vpc.sh printed.

For example:

. ec2-spotter/fast_ai/start_spot_no_swap.sh --ami ami-53b23433 --subnetIdsubnet-9f69c3d6 --securityGroupId sg-a62f2ede

The script will then print the IP of our new Spot instance.

If we want, we might also pass the following: volume_size (size of the root volume, in GB. Default 128), key_name (name of the key file we’ll use to log into the instance. Default: aws-key-fast-ai), ec2spotter_instance_type (type of instance to launch. Default p2.xlarge), bid_price (The maximum price we are willing to pay (USD). Default 0.5).

1.4 Login and test

Using the IP of the Spot instance from the previous step, we can connect via ssh:

ssh -i ~/.ssh/aws-key-fast-ai.pem ubuntu@$instance_ip

Now we can develop and test ML models to our hearts delight. For example, let’s test it with Tensorflow’s tutorial on MNIST:

python src/tensorflow/tensorflow/models/image/mnist/convolutional.py


I tensorflow/core/common_runtime/gpu/gpu_device.cc:885] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.11GiB
Step 0 (epoch 0.00), 771.1 ms
Minibatch loss: 8.334, learning rate: 0.010000
Minibatch error: 85.9%
Validation error: 84.6%
Step 100 (epoch 0.12), 12.2 ms
Minibatch loss: 3.262, learning rate: 0.010000
Minibatch error: 6.2%
Validation error: 7.3%

All seems good!

We have a Virtual Private Network, a Spot instance instance running in it for a fraction of the price, and even a model training. But all is not roses!

What if with hard work and wit, we manage to get the above MNIST script to achieve above state-of-art accuracy. And then we shut our instance for the night. Our great model would be lost. We need to find a way to persist the data on our Spot instances. Luckily, we found two.

2. Persistence for Spot Instances: Approach 1 — Attached volume

We are going to look at two 2 main approaches I’ve used to persist changes to Spot Instances. The first approach uses a separate volume, where your models/data is kept. This volume is attached to your Stop instance after you start it up. Instructions on how to do this follow:

First off, start a Spot instance. (see “Run a Spot Instance for ML” above).

2.1 Create a volume

(only do this once). We can do it via AWS cli or the web-based AWS Console.

Create a volume with AWS Console: Open the AWS Console, select EC2, then:

Step 1. Open “Volumes”

Step 2. Click the “Create Volume” button

Step 3. Make sure that the Volume Type is “General Purpose SSD (GP2)”. A fast SSD drive helps us when we constantly need to access the disk — e.g when our data is to big to fit in memory.

Step 4. Choose an appropriate size of the volume. I usually use 100 GBs.

Step 5. Make sure the Availability Zone is the same as the one the instance is in.

Create a colume with aws cli:
In the below bash command, change the size of the volume as needed and set the availability zone to the one the instance is in.

aws ec2 create-volume --size $volume_size --availability-zone $availability_zone --volume-type gp2 --output text --query 'VolumeId'

2.2 Attach the volume to the instance

Again we can use the cli or the web Console

Attach the volume with aws cli: Volume ID was printed by the create-volume command in the previous step. If we don’t know our instance ID yet, we can review our instances.

aws ec2 attach-volume --volume-id <value> --instance-id <value> --device /dev/sdh

Attach the volume with the AWS Console: Open the AWS Console, select EC2, then:

Step 1. Select the Volume we just created (see the created data if unsure).

Step 2. Open “Actions”

Step 3. Click “Attach Volume”

Step 4. Click the Instance field. A list of instances will pop up.

Step 5. Select the instance to attach the volume to by clicking it. 
We’ll leave the Device field at the default value

Step 6. Confirm with the “Attach” button.

2.3 Mount the volume

The steps below are from Amazon’s own tutorial on “Making an Amazon EBS Volume Available for Use”:

Step 1. SSH into your instance. 
ssh -i ~/.ssh/aws-key-fast-ai.pem ubuntu@$instance_ip

Step 2. Run lsblk to see under what name was the volume attached. Usually in ubuntu it will be the named “xvdf” and will be the last entry on the list.

Step 3. If we just created our new volume, we’ll need to format it with a file system. 
CAUTION: Only do this on a newly created volume, otherwise you will erase all data on it. 
Run: sudo mkfs -t ext4 device_name where device_name is “/dev/” plus the device name from step 2. For example: /dev/xvdh

Step 4. Create a directory to mount the volume:sudo mkdir mount_point

Step 5. Finally mount the volume sudo mount device_name mount_point

Now, anything you put in the mount_point dir will be stored on the attached volume. This means that you can terminate the Spot instance, then start a new one later, attach and mount the volume as described above and continue where you left off. You might want to automate attaching and mounting the volume with aws-cli and crontab.

I used this approach for a while, and it worked fine, but it had a big drawback — anything you do outside of the persistent volume is lost when you terminate the instance. You can try to move your user folder to the attached volume, but this didn’t work out for me. 
Instead, the next approach worked beautifully.

3. Persistence for Spot Instances: Approach 2— Swap root volume

As I quickly grew tired of installing this and that every time a new spot instance was started, so I looked for an alternative. Starting with this approach, I finally managed to make spot instances behave similarly to on demand ones. This works by swapping the root volume (where the operating systems runs) for another volume right after booting up. 
Since I spend quite some time on making this work, I hope this can be helpful to others as well. Use this script with a new instance, or an existing one.

Step 0.

  • Make sure you have jq installed.
  • Make sure you have downloaded the helper scripts. If not run:

git clone --depth=1 https://github.com/slavivanov/ec2-spotter.git

3.1 With a new instance

Step 1. Start a Spot instance (see “Run a Spot Instance for ML” above).

Step 2. Run sh ec2-spotter/fast_ai/config_from_instance.sh . It creates a config file for launching a spot instance from an existing spot or on-demand instance named fast-ai-gpu-machine. (this is what our new instance is named in Step 1). Instead of the finding the instance by name, you might pass the script the an instance id like this sh config_from_instance.sh --instance_id i-0fd47cabf6ce1d534
CAUTION: The script will also terminate the instance from Step 1. If you have other instances launched named fast-ai-gpu-machine, the script might terminate them instead, so rename them before running it.

sh ec2-spotter/fast_ai/config_from_instance.sh

Step 3. Then every time you need a P2 Spot instance, just run:

sh fast_ai/start_spot.sh.

It will start a new Spot instance, then at boot time it will swap its root volume with the volume of the Step 1 instance. 
It might take a few (2 to 5) minutes to finish.

Now when you terminate the instance and start it later (using start_spot.sh), any changes to the filesystem it will persist!

3.2 Using an existing instance

Step 1. Stop your existing instance and detach its root volume.

Step 2. Give the newly detached volume any name.

Step 3. Create a copy of example.conf named my.conf:

cp ec2-spotter/example.conf ec2-spotter/my.conf

Step 4. Modify the settings inside my.conf. Especially:

ec2spotter_volume_name : the name you gave the volume in step 2.
ec2spotter_launch_zone : the availability zone where you want to launch your instance.
ec2spotter_subnet : ID of the subnet to use.
ec2spotter_security_group : ID of the security group to attach to the instance.
ec2spotter_preboot_image_id : The image to preboot the instance with. Both Amazon and Fast.ai use Ubuntu 16.04 as base for their ML images. So we need to supply the Ubuntu 16.04 ami here: ami-a58d0dc5 (Oregon), ami-405f7226 (Ireland) or ami-6edd3078 (Virginia).

If you don’t know subnet ID yet, you can get it from your subnets. Same for the security group.

Step 5. Then every time you need a P2 Spot instance just run

sh fast_ai/start_spot.sh.

It will start a new Spot instance, then at boot time it will swap its root volume with the volume of the Step 1 instance. 
It might take a few (2 to 5) minutes to finish.

Now when you terminate the instance and start it later (using start_spot.sh), any changes to the filesystem it will persist!

4. Stop a Spot instance

When I’m done for the day, if there are no models training on the box, I stop the instance with this :

# Change $instance_id to your instance id obviously.
aws ec2 terminate-instances --instance-ids $instance_id

A month after I switched to Spot instances, the AWS bill came again. This time it was for tens rather than hundreds of dollars. The weather outside was good.

If you liked this article, please help others find it by clicking the little heart icon below. Thanks a lot!