Deep Learning in the Cloud: TensorFlow on EC2 Spot Instances with Ansible

If you’ve ever tried playing with Deep Learning, you’ll have found out that you’re not going to get very far without a GPU, and if you didn’t already happen to have an NVIDIA card, you’re going to be left with a choice between spending many hundreds of dollars up front or renting a GPU from a cloud provider.

Renting from a cloud provider will be more expensive in the long run, I’ve personally spent about a hundred dollars in spot instance costs so far, but if you’re not sure how long you’re going to stick with this and you’re not a gamer, the cloud can be a good way to try this out.

Despite promises from Google, Microsoft & Alibaba to have GPU instances available, AWS seems to be currently the only game in town for actually renting a GPU instance, so that’s what we’re going to use.

You might also be wondering why you would want to use Ansible, and though my main motivation is liking Ansible, you’re going to want to turn these instances off when you’re not actively using them, and so you’re going to want an easy way to start an instance and setup all your tools easily.

Step 0: Setup AWS basics

If you don’t already have an AWS account setup, you will need to set one up and create an AWS Access Key

Step 1: Install & Configure Ansible

Ansible has setup instructions available here, but the short of it is that if you’re on a *nix-based OS, you can find Ansible in your package manager, if you’re on OSX you can install it from pip.

Ansible doesn’t run natively on Windows, so if you’re a Windows user like I am, you will probably want to install Bash on Ubuntu on Windows 10; you can find some instructions for that here.

Next, you need to configure Ansible to talk to AWS for you, so place the following files in /etc/ansible:

https://raw.githubusercontent.com/ansible/ansible/devel/contrib/inventory/ec2.py
https://raw.githubusercontent.com/ansible/ansible/devel/contrib/inventory/ec2.ini

Configure Ansible to use these scripts:

export ANSIBLE_HOSTS=/etc/ansible/ec2.py
export EC2_INI_PATH=/etc/ansible/ec2.ini

Configure Ansible/Boto to use your AWS access keys:

export set AWS_ACCESS_KEY_ID=XXXXXX
export set AWS_SECRET_ACCESS_KEY=YYYYYY

You can also disable SSH host key checking to make this more automatic:

export ANSIBLE_HOST_KEY_CHECKING=False

Quickstart Alternative:

If you’ve never used AWS before (or are happy to create a new KeyPair/VPC just for tensorflow) and would like to get up and running asap, run these commands:

ssh-agent bashgit clone https://github.com/kuza55/ansible-examples.gitcd ansible-examples/tensorflowsudo ansible-galaxy install kuza55.tensorflowansible-playbook ec2_res.yml tfkey.yml tensorflow.yml

And you can go to step 5 in 10–15 minutes when your instance is ready.

Step 2: Setup AWS resources

If you‘ve used AWS before, you probably already have: a keypair, a publicly routable subnet, a security group, and all you will need is a persistent EBS volume to store your data.

If you’re new to EC2 and didn’t use the quickstart option, you can download this playbook and then run:

ansible-playbook ec2_res.yml

This will setup all the necessary AWS resources and print out the relevant IDs, which you should make a note of for Step 4.

If you’re wondering about some of the decisions coded into this playbook, you can take a look at the considerations section at the bottom of this post.

Step 3: Setup ssh-agent

So far our use of Ansible has been merely to interact with AWS, but now we’re going to use Ansible to request EC2 instances and configure them, for this Ansible needs to be able to SSH into these instances, the easiest way to make sure it can do this is with ssh-agent.

So, first launch ssh-agent and then add your AWS key pair (such as the one generated by Ansible):

ssh-agent bash
ssh-add tfkey.pem

Step 4: Setup your instances!

Now we’re finally ready to actually start setting up EC2 instances.

First grab my playbook and fill in the relevant variables:

  vars:
region: us-east-1
az: us-east-1d
#tf_subnet_id: Subnet IF goes here
#tf_sg_id: Security Group ID goes here
#tf_vol_id: Volume ID for persistent /data mount

Next, install the playbook’s dependencies:

sudo ansible-galaxy install kuza55.tensorflow

At this point, you‘re ready to spin up your instance, by running:

ansible-playbook tensorflow.yml

Expect this to take about 10–15 minutes until your instance will be ready.

Step 5: Connect to your Instance!

You should see some output that looks something like:

PLAY RECAP ********************************************************************
52.23.210.205 : ok=... changed=... unreachable=0 ...
localhost : ok=... changed=... unreachable=0 ...

You should be able to connect to the IP you see in that output:

ssh ubuntu@52.23.210.205

At this point the server has been setup with everything you should need to use run TensorFlow on Amazon’s GPUs:

  • Java (Necessary for Bazel)
  • Python 2.X
  • CUDA 7.0
  • cuDNN 4.0
  • Tensorflow 0.8.0
  • Bazel (Necessary for running some Tensorflow examples)

And your persistent data volume has been mounted at /data which is probably where you will want to do your work.

At this point, you’re ready to go!

I’ve personally found it useful to also install Jupyter to do my data munging and interacing with some TensorFlow models, but that’s an exercise for the reader at the moment :)

Thanks for sticking around

Personally, I am a fan of Ansible since I feel like setting up software is often the worst part of working with computers and I have a hope that by having a standardized interface for installing and configuring software, our lives will all get much simpler.

What follows is some assorted notes that I thought I would be useful to share, but aren’t necessary for getting started.

Shutting Down

When you’re not actively using your spot instance, you’ll want to shut it down to stop giving AWS all your money, to do this you can change a parameter to the ec2 module in tensorflow.yml to say you want zero instances:

- name: Provision a set of instances
ec2:
...
exact_count: 0

And then re-run the playbook. Then when you want to start your instance again, just set it to 1 and run the playbook again.

If you’re completely done with whatever you were working on, you’ll probably want to delete your EBS volume. You’re probably best off just doing this from the web UI, though you can also set state:absent nn the ec2_vol module in ec2_res.yml.

Dealing with Spot Instances

At some point someone is going to bid way more than is sane for your instance class and your instance is going to get shutdown.

The best way to handle this in Tensorflow is to make sure the model you are running is writing checkpoints to disk in your /data dir.

Once you’ve got that sorted out, you have a machine that can be killed and restarted without any issues; if you’re not running a lot of training, this may be enough.

If you want to restart your training once Spot prices become sane, you can configure Ansible to run a command after the instance is ready using one of the command modules like so:

- command: /data/scripts/mytrain.sh arg1 arg2
become: yes
become_user: ubuntu

With a little bit of duct tape, you can then just run your playbook in a loop on the machine that is running Ansible and while it will cause a bit of overhead, Ansible will make sure that once you can get spot instances at your chosen price, it will start those instances and resume training your model:

while true; do ansible-playbook tensorflow.yml; sleep 120; done

This doesn’t really have a way for your script to say it’s done, but since most neural nets seem to be finished training when you eyeball a graph and say it’s done this shouldn’t be a huge issue for now.

Baking an AMI

One of the other nice things about Ansible is that it has a module to create an AMI, which if you are unfamiliar with the concept is essentially a VM image with all your software setup that you can launch directly which reduces your setup time dramatically.

All you need to do to bake an AMI is use ec2_ami module, specifically you can either add this whole snippet to the end of your existing playbook:

- hosts: localhost
connection: local
gather_facts: False
tasks:
- ec2_ami:
region: us-east-1
instance_id: "{{ tensorflow.tagged_instances.0.id}}"
wait: no
name: tensorflow_ami
register: tf_ami
- debug: var=tf_ami

Or you can run the bake.yml example playbook against your existing instance like so:

ansible-playbook bake.yml -e"instance_id=i-XXXXXX"

Either of these options will print out your AMI ID and you can slot these into tensorflow.yml and remove the tensorflow role and your instance will be ready much faster!

Ansible Roles

One of the nice things about Ansible that you should know is that it’s really easy to make use of roles. If you find a role (in Galaxy or not), you can stick it right into a roles/role-name directory relative to your playbook and then make any changes you need to fix anything that doesn’t fit what you want

Considerations

TLDR: g2.2xlarge

You should probably rent a g2.2xlarge Spot Instance at first; currently these instances are about $0.1–0.2/hr, compared to the $0.65/hr for regular instances. If your model supports GPU parallelism, and you would like a ~2.5x speedup for slighlty more than ~2.5x the price, you should take a look at the g2.8xlarge instances too, but you can change

TLDR: 500GB st1 instance if you have a project in mind, 10GB gp2 otherwise

Unless you just want to play around a little bit and don’t care what happens if AWS kills you’re spot instance, you’re going to provision a persistent EBS volume where you store all your data.

I went with a 500GB st1 instance. It’s more expensive than a small gp2 instance, but I’m aleady using up about 200GB of storage, largely to store a tonne of image training data, so I’m close to break even on a few side projects I’m working on.

An alternative is to use something like s3fs, but it’s not clear to me that you’ll really benefit from doing so.

TLDR: us-east-1d

If you have gone to create your EBS volume you will see that it asks you to specify an AZ, and while you can migrate your EBS volume to another AZ by creating a snapshot and restoring it, this isn’t an operation you really want to be doing regularly, so take a look at https://ec2price.com/ and see what instance prices are like when you’re reading this.

Written by

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store