Mastering the Spot Request — Deep Learning at a Discount

Launch and use AWS spot-instances with one simple script

Published in

The Startup

9 min readSep 18, 2019

This article is a walk-through of how to launch and interact with spot-instances using Python. Re-create it yourself or use the code on Github.

Intro

I’ve been working with huge datasets my entire career. Everything from national health surveys and global air quality data to daily activity streams for an entire university. When it comes to computing power I’ve only ever asked for one thing, MORE! I usually had access to a super computer or some fancy piece of hardware, but when I went to work for a start-up I began looking for some DIY solutions. That’s when my good friends at Replica Studios put me on to spot-instances. Check out the deep learning company for an example of what is possible when you master cloud computing for cheap.

Spot Instances

AWS will rent out hardware to anyone who wants it as long as they are willing to pay the price. Like most businesses, AWS cannot rent out everything they have all the time and are often left with excess capacity. Fortunately, Amazon opens up this excess capacity for bidding at a fraction of the original price.

The Catch

Unfortunately, instances that are initiated through spot requests can be requisitioned by AWS whenever someone IS willing to pay the full price. This means your jobs can get interrupted. You get a 2-minute warning allowing you some time to save your work.

This article will introduce you to the spot_connect.py script to automatically create and use spot instances.

Getting to Know AWS

We will make use of the following AWS services in this walk-through:

AWSCLI — AWS Command Line Interface allows your local machine to interact with AWS.
Boto3 — Python module that allows you to do everything possible in the AWSCLI in a python
EC2 — all services that have to do with launching virtual machines on rented servers.
EFS (Elastic File System) — Storage option that grows as you add data.
IAM — services that provide user verification and access services.

Setting Up Your AWS Account

Your first step is creating an AWS account. Once you’ve done that you will have access to all AWS services.
Download and install AWS Command Line Interface using
pip install awscli.
At your AWS dashboard go to the services menu at the top left and select IAM under Security, Identity and Compliance. Click on Users and then create an access key for your user.
Open the command prompt you used to install awscli and type aws config and follow the instructions to put in your key.

Creating a Spot Instance

If everything was configured correctly you should be able to download the Spot-Instance-AWS repository and run the spot_connect.py script from your operating system’s command line. The README explains which options are available when running the script and what they do.

The GIF above shows a user executing the spot_connect.py script to:

launch a spot-instance
create an EFS
link the instance to the EFS
upload a script and data to the EFS via the instance
run a test script, and
leave open an active shell connected to the instance.

The rest of this article will provide an overview of how this gets done. Follow along in the spot_connect.py script to understand the details of the code.

Example Use

python spot_connect.py -n test -p default -s test.sh -f project_storage

The command above creates/reconnects to and instance names “test,” it uses the default profile settings in the spot_connect.py script, executes the “test.sh” script and creates/connects to an elastic file system with the name (creation token) “project_storage.”

Requesting a Spot Instance

This section will cover the launch_spot_instance method which creates or reconnects to spot instances.

Key Pairs & Security Groups

The first things to define are the key pairs and security groups. Key pairs allow you to connect to the instance once it is created, security groups tell the instance which ports it is allowed to connect to.

The spot_connect.py script depends entirely on boto3 to interact with AWS resources. You can use the client class to connect to any AWS service and run any function you would run through the dashboard. The very first thing the script does is connect to EC2 using client=boto3.client('ec2').

The client object lets us create and interact with any sub-services available through EC2. Using them is as easy as checking the documentation. Our script tries to create a key-pair using and security group using the instance name submitted by the user. client.create_key_pair() creates a key pair or detects an existing one. client.create_security_group() does the same for security groups.

When the script creates a key pair, it saves the private key to a “.pem” file in the directory where the script is being executed. Make sure this file is present when trying to reconnect to instances. When creating the security groups, we also define some ingress and egress rules using variations of the following command:

client.authorize_security_group_ingress(
   GroupName = security_group_name,     
   IpPermissions = [{
      'FromPort':2049,
      'IpProtocol':'tcp',
      'IpRanges':[{'CidrIp':'0.0.0.0/0'}],
      'ToPort':2049}]
)

The command above identifies a security group creates a rule that lets through credentialed users from any IP address requesting access through the designated port. Each port serves a different purpose:

22: Remote connection port for users (i.e. Putty, Paramiko, etc…)
80, 443: HTTP & HTTPS which can be used for other AWS services such as DataSync.
2049: for NFS access. This is necessary to connect the storage system we will use to the instance.

Launching the Instance

With security set up, we review the current spot instance requests using client.describe_spot_instance_requests(). If none are detected, we make a new spot instance request. A request persists as “active” as long as the instance is still running. When an instance is terminated the request disappears.

spot_connect.py uses the profile dictionary in the __main__ argument to define the hardware, OS and max price the user is willing to pay for the instance. The profile dictionary is where one can set the main parameters for a spot instance. When a new request is made, these parameters will be submitted to AWS through the client.request_spot_instance() method.

# The "default" profile in the profiles dictionary "default":{
  # default ingress rule necessary to connect to an instance 
  'firewall_ingress': ('tcp', 22, 22, '0.0.0.0/0'),  # Go to the launch-wizard to find recommended image IDs.
  'image_id':'ami-0859ec69e0492368e',   # List of types at https://aws.amazon.com/ec2/instance-types/
  'instance_type':'t2.micro',   # Prices at https://aws.amazon.com/ec2/spot/pricing/  
  'price':'0.004',
  
  # Instance type availability changes by region
  'region':'us-west-2',
  
  # Scripts you want to execute as soon as the instance launches
  'scripts':[],
  
  # Default usernames https://alestic.com/2014/01/ec2-ssh-username/
  'username':'ec2-user',  # If true mounts a file system on the instance.
  'efs_mount':True        
}

The instance may take a few moments to initialize and be ready for use, if an error is returned check the “instances” tab under the EC2 dashboard. One common mistake is to set the price too low. Make sure to avoid this by checking on prices beforehand here.

Once a request has been submitted client.describe_instances() can be used to find any instance and retrieve the information necessary to connect to it. At this point a virtual machine is available on the private cloud an ready to be used.

Setting up Storage

This section covers the launch_efs and retrieve_efs_mount methods in spot_connect.py.

Elastic File Systems

With a virtual machine ready, we need to be able to access our data to run things on it. For this we use EFS because it allows us to get around some of the disadvantages of using spot instances. As mentioned above, instances initiated through spot request can be shut down with a two minute warning. This means you may need to launch more than one spot request before you are able to complete a set of tasks.

Elastic file systems make it so that all data is available to our instance as if it were just another folder on the machine. In other words, there is no need to download data each time a new instance is created, simply connect (“mount”) your EFS onto your instance.

Mounting a File System on an Instance

To accomplish this spot_connect.py creates a new client object connected to the EFS service using client=boto3.client('efs') and checks to see if the file system specified by the user exists. If not, it creates it using client.create_file_system(). Then it checks to see if any mount targets exist for the file system. Mount targets are basically IP Addresses we assign to the EFS for access.

If none are detected client.create_mount_target() is used to create one. Creating a mount target requires submitting a specific IP address. We use IPNetwork method from the netaddr module to retrieve a list of the IPs available to our cloud and select one to use for the mount target.

With the EFS and mount target created, the only thing left is to connect the EFS to the instance. This must be done from within the instance using the steps described in this guide. spot_connect.py creates a bash script called “efs_mount.sh” and saves it to the local directory with all the information necessary to connect to the EFS specified by the user.

Specifically, “efs_mount.sh” will create a folder in the instance and mount the EFS on that folder such that any directory and file placed in that folder will be preserved and available in any other instance mounted on the EFS.

Working with Instances

This section covers the connect_to_instance, run_script, upload_to_ec2, and active_shell methods in spot_connect.py.

Connecting to an Instance

Connecting to an instance can be done using Putty or other similar services. We use the paramiko module for Python to connect to our instances, upload data and run scripts. Paramiko can create an SSH client and connect to our instance. Using the ssh_client we use the run_script method in spot_connect.py to run each the “.sh” scripts we submit, including the “efs_mount.sh.”

def connect_to_instance(ip, keyfile, username='ec2-user', port=22):
    ssh_client = paramiko.SSHClient()
    ssh_client.set_missing_host_key_policy(paramiko.AutoAddPolicy)
    k = paramiko.RSAKey.from_private_key_file(keyfile+'.pem')                  
    retries = 0 
    connected = False 
    print('Connecting...')
    while connected==False: 
        try:
            ssh_client.connect(ip, 
                               username=username, 
                               pkey=k, 
                               port=port, 
                               timeout=10)
            connected = True
            break
        except Exception as e:
            retries+=1 
            sys.stdout.write(".")
            sys.stdout.flush() 
            if retries>=5: 
                raise e  
    print('Connected')
    return ssh_client

Uploading Data

With the EFS mounted onto the instance we can now store data on the EFS via the instance. We do this using an ordinary STFP transfer with Paramiko:

def upload_to_ec2(instance, user_name, files, remote_dir=b'.'):
    print('Connecting...')
    client=connect_to_instance(instance['PublicIpAddress'],
                               instance['KeyName'],
                               username='ec2-user',port=22)
    print('Connected. Uploading files...')
    stfp = client.open_sftp()
    try: 
        for f in files: 
           print('Uploading %s' % str(f.split('\\')[-1]))
           stfp.put(f, 
                    remote_dir+'/'+f.split('\\')[-1],
                    callback=printTotals, # show progress
                    confirm=True)
    except Exception as e:
        raise e
    print('Uploaded to %s' % remote_dir)
    return True

Upload speed does not depend on instance type, only on your internet connection speed. To give an idea of speed, it took under 8 minutes to upload a 150 MB file on a busy 5 GHz wifi connection. If that is not speedy enough, AWS has created the DataSync service which allows for high speed real time transfers from local systems or other AWS services. There is a “datasync” profile in spot_connect.py which launches a spot instance with the DataSync AMI and recommended hardware to create an agent.

Any new data that is created or changed in the instance and saved into the EFS folder will automatically be available and persist in the EFS, even if the instance is terminated. This makes it easy to resume work in case a spot-instance is requisitioned.

Active Shell

If you are paranoid like me you like being able to double check things yourself. Fortunately, we can use one of the files available in the paramiko repository to create an active shell through which we can access our instance in real time, as if connected to it directly. The GIF at the beginning of the article shows a user change directories and check list the files in the directory.

Termination Time

The last thing we need to cover is termination time. If amazon needs your spot-instance to fulfill an on-demand request it will interrupt your instance. We can make URL requests from within the instance to check whether our instance is about to be shut down. Users should consider using the Shebang line (or other similar command) in their python programs in order to run it them in the background.

The requests will respond with a code of 404 (not found) if it is not in danger and 200 when the instance has 2-minutes left. Users may build a loops into their bash scripts such that termination time requests can be made repeatedly and interrupt and save everything if the termination time is set. Some users suggest sending a termination time request every 5 seconds.

Wrap-Up

It should be noted that the spot_connect.py script does not duplicate any AWS resources, if it finds a request, instance, key pair, security group or EFS with the same name it will attempt to reconnect. Keep track of the names you use and remember to use the dashboard to troubleshoot any unexpected errors. Make sure everything you do happens in the same region, many AWS services cannot interact with each other across regions.

Credit to Pēteris Ņikiforovs. The code started as an update to his original post on working with Spot Instances but eventually grew to include new functionality. Feedback is welcome, particularly on how to program around termination time requests more efficiently. Thanks!