The Data Science of AWS Spot Pricing

sdksb
Cloud Uprising
Published in
6 min readSep 14, 2015

--

Amazon Web Services is the 800-lb gorrilla in Cloud Computing and provides a number of different pricing models ranging from pay-per-use (on-demand), fixed (reserved), and auction-based (spot). This spot model is on the order of 8–10x cheaper as compared with on-demand, but with a twist: since instances in the spot market are auctioned, you may or may not “win” the auction. Even if you win and get the instances you asked for, you may subsequently loose them with little notice depending on the next auction and how others are bidding. Despite this added complexity, with almost an order of magnitude cost savings, it is worth exploring how to use Spot in at least part of your workloads.

AWS provides data on spot market prices for each instance type in each availability zone (AZ) for the trailing 90 days. For example, Figure 1 shows the market price for a c3.8xlarge in the us-east-1a AZ for the past week. On average, its 2x below the on-demand price of $1.68/hr, but there are spikes in the pricing that push the price up 10x (!) over the on-demand price. When the market price exceeds your bid price, then your resources will be lost with very little notice. So you have to set a bid price balancing volatility and cost.

Figure 1: Spot Market Price fluxuations for c3.8xlarge instance in us-east-1e.

Therefore, it’s worthwhile to build algorithms to manage your spot requests dynamically based on current market forces — similar to algorithmic trading where software decides when and what stocks to trade. Using some common open-source data science tools, I’ll show you how to access the spot market price data, manipulate and graph the data. I’ll leave it up to you to develop your own algorithms, but with the available tool set, it is accessible even to non-expert programmers.

All the code used here is accessible in a github repository (see here). The toolset uses iPython, a web-based interactive programming environment similar to Mathematica or Matlab; Pandas for data manipulation; and Mathplotlib for graphing. You can add Scikit for machine learning and theano for deep learning neural networks if you wish to do more analysis of the data or build your own predictive models. To run the code you will need an AWS account. Lets get started …

Getting Started

Assuming you have setup your AWS account, downloaded iPython and started the interactive notebook session, we can get started. We first create a new notebook and set the parameters for our analysis, including the instance types we will examine and the time frame.

We use the Boto python library to access AWS api’s, but if you followed the link to the source code and the instructions for installing iPython, then you have boto installed. It will pull your credentials from environment variables, so make sure that AWS_ACCESS_KEY and AWS_SECRET_KEY are setup.

% export AWS_ACCESS_KEY_ID=<your access key>
% export AWS_SECRET_ACCESS_KEY=<your secret key>

Then we create a new code cell in your notebook and initialize boto using:

Now, we can download the pricing data and store it into a Pandas dataframe:

This might look complicated, its not really. The script uses the ec2.get_spot_price_history api call to download the data for each instance type, and since the data may come in batches, it downloads each batch until it has all the data you are asking for. In the end, it converts the data into a dataframe so you can easily manipulate and graph it.

Data Overview

The data has a timestamp, instance type, availability zone, and the market price at that timestamp. The first thing we can do is plot the price over time to see the volatility of the c3.8xlarge market for each availability zone:

Figure 2: time series graph of Spot Market Price for c3.8xlarge for different AZs.

This shows that pricing in different availability zones have different characteristics: initially (to the left on the graph), the price in us-east-1a is spiking to $6/hr quite often while the other AZs are more stable; and more recently (to the right of the graph) we see us-east-1e to have significant volatility.

We can also easily generate a price histogram, this time looking at a single AZ and partitioning out the data for the 4 instance types in the family:

Figure 3: Histogram of prices for c3 family in a single AZ, us-east-1a.

This histogram shows you that (at least for us-east-1a AZ) in general as you go from smaller to larger instance types the hourly price also increases: the c3.xlarge in green on the left is the least expensive, the c3.2xlarge the next expensive, and so on. The c3.8xlarge, in contrast, is often half as expensive as a c3.4xlarge which is 1/2 as powerful. When making spot requests, it is beneficial to consider multiple instance types to get the lowest price.

It is also interesting to note that each instance type has a minimum price, and the c3.2xlarge minimum price is almost exactly 2 times the minimum price of the c3.xlarge, and the c3.4xlarge minimum prices is 2x the c3.2xlarge, and so on. You can also easily generate numerical averages and variances to produce exact values in addition to visual analysis via graphs.

Figure 4: numeric calculation of statistical properties of market price.

Price Trends

Another hypothesis is that the market prices vary by time of day or day of the week. We can extract the hour or day of each recorded price data and determine the average/stdev for each hour of the day. Figure 5 shows both the average and standard deviation of the values by hour for the c3.xlarge instance.

Figure 5: average price for c3.large aggregated by hour of the day. Plot shows both median and stdev.

It seems the best time to use this instance is 11am UTC (which is 7am EST) when the price is 30% less than the overall average. With a low standard deviation, that indicates that this pattern is recurring.

You can similarly look at day of week patterns or other time frequencies. Pandas is an essential tool for time series data, providing filtering and resampling that are both powerful and easy to use.

Summary

The examples shown here show that accessing and analyzing the historical spot pricing is easy using modern data science toolsets and very simple calculations and aggregations can lead to insight that can save you significant money.

But be warned: the pricing changes week to week and even day to day. If Netflix decides they need to re-encode all their movies for the new AppleTV, then the price and availability statistics of the instances will change dramatically. Your mileage will vary and that’s why it is important to build algorithms that can adapt to market fluxuations.

With this as your basis, you can go in a number of interesting directions, for example:

  • determine the lower bound on the probability that the market price will exceed your bid price, and hence, the probability that your resources will complete their computation,
  • analyze the volatility of each AZ and instance type in order to determine where to request instances,
  • use more advanced techniques to find recurring patterns in the data that you can then take advantage of.

If you have any thoughts or comments, feel free to post here or contact me at: karan.bhatia@gmail.com

External links:

--

--

sdksb
Cloud Uprising

Google Cloud Platform for Scientific and High Performance Computing