Running Apache Spark on AWS without Busting the Bank

Published in

Snipe.gg

8 min readNov 19, 2018

At Snipe, we deal with huge amounts of data. Considering the fact that there are more than 100 million active League of Legends players out there, each making hundreds of decisions within every single game, it should be obvious that in order to isolate the decisions that matter one needs to go through billions that don’t. Adding to the equation the fact that the “meta” of the game shifts on a bi-weekly basis, it’s also easy to grasp that you need to go through this process consistently.

In order to deal with those numbers, we opted to use Apache Spark — a cluster-computing framework that’s great for data processing and analytics at scale. The problem: computer clusters can, naturally, get very expensive.

Guided by a startup “gotta go fast” and “gotta (be able to) pay rent” mindset, we set out to find the best way of getting a Spark cluster running quickly on the cheap, on AWS, without compromising on performance.

An important point to make before we start is that we’re using Spark to process data on external data sources, such as S3 and MongoDB. This post may prove of less use to those wishing to run Spark on an existing Hadoop cluster (in order to process data on HDFS, for instance).

To put our “fast & cheap” mantra into technical terms, we wanted to achieve the following:
1. Automated Setup — Setting up a Spark Cluster from the ground up takes time, mostly because of configuration considerations. Spark is a high-level framework that relies on several different layers: it is written in Scala, which itself is reliant on a JVM; and it also relies on a resource management platform, such as Hadoop YARN and Apache Mesos (though it’s also deployable in a standalone mode). This results in a myriad of possible configurations, all with their pros and cons. We wanted to be able to deploy a proven-to-work configuration fast and optimize it to fit our needs later.

Legendary producer Bruce Dickinson addressing the only prescription to his fever.

2. Scaling Capabilities — There are times when you just need more processing power. When you’re analyzing tens of Gigabytes of data at a time, you can only do so much on optimization of an existing framework. If Bruce Dickinson wants more cowbell, you should probably give him more cowbell. Having said that, the choruses on Don’t Fear the Reaper have less cowbell than the verses, and the guitar solo has virtually none. When running heavy analytical tasks (i.e, batch processing tasks), you will typically not be running the same tasks constantly, but rather in timed intervals (If you see yourself running the same heavy data processing job constantly, over and over again, you should reconsider the design of your job). As such, we want to be able to scale out when running heavy tasks, so they’d have more resources to work with and finish faster, and then scale in after they’re done, which, when running on the cloud, will save us a lot of money.

We started our research by evaluating “AWS-native” solutions, ones we could launch straight from the AWS Management Console, with the assumption that they will be fast to set up and come pre-configured to work well with AWS infrastructure. The service we were looking for was Elastic MapReduce [EMR], it essentially provides a 1-click setup Spark cluster running on Hadoop YARN (other Hadoop-based frameworks are supported, too), complete with monitoring, auto-scaling, and various other capabilities. Unfortunately, upon checking the EMR Pricing, we realized that EMR adds a pricing overhead of a whopping 25% to the price of the EC2 instances it uses, rendering it unusable with our budget. I would further argue that a pricing overhead of that magnitude renders EMR cost-ineffective for most Spark-related use cases. The AWS Marketplace, while being more affordable, didn’t have anything that supports auto-scaling.

Knowing that AWS’ native solutions don’t fit our needs, we started looking for alternatives, this split into two decisions: the infrastructure we want to use (essentially boiled down to the types of EC2 instance), and the software we want to use to set up, manage, and scale clusters.

On instance type, we figured we want to start with the M-family of EC2, as they are general purpose, fixed performance instances, and that we should be using Spot Instances whenever possible rather than On-Demand ones. A few words on AWS Spot Instances for those unfamiliar with the concept (which is relevant to most other cloud providers, too)— AWS allows you to bid on unused EC2 machines. At any given time there’s a certain price on each Instance type if you bid the price or more — you will receive the requested instance practically instantly. The price for each Spot Instance is usually at least 50% off of the standard On-Demand Instances.
What’s the catch? AWS can and will terminate your instances if they need to take them down for any reason, without prior warning. This puts a lot of people off, however, Spot Instances can often run for months uninterrupted, making them ideal in handling temporary production workload (such as a batch data processing job).

As for the software, we opted to use Flintrock, an open-source, configurable CLI for launching and scaling Spark clusters on AWS. It is somewhat of a spiritual successor to spark-ec2 — an older utility that used to come baked into Apache Sparks installations (it is now defunct). We found Flintrock to be versatile enough for our needs, allowing us to create clusters that mix On-Demand Instances with Spot Instances, while also being fast, creating or scaling clusters usually takes less than 5 minutes (our average is around 3 minutes per cluster change).

The next segment will advise on deploying a Spark Cluster similar to ours, that goes fast without infuriating the bank, it assumes a basic understanding of AWS, Unix shell, and Apache Spark.

Apache Spark Cluster Setup with Flintrock

Start by installing Flintrock either on your workstation or on any AWS server used for operations (I would recommend the latter). Continue by using the flintrock configure command to alter Flintrock’s default configuration. This should open up a .yml file in your default text editor, there are a few important values that you may need to change:

services:
 spark:
  version: # The version of Spark you wish to runproviders:
 ec2:
  key-name: # The name of the key pair to be used for SSH
  identity-file: # The path to the key pair (on local machine)
  region: # The AWS region you want the cluster created on
  ami: # AMI you want to use for the instances. For instance, we’re  using ami-a0cfeed8 which is the Amazon Linux AMI on us-west-2.launch:
 num-slaves: # number of slaves you want to have permanently running in your cluster.

After you’re done configuring, simply run the flintrock launch cluster-name command to have it use the AWS API to launch a cluster with your chosen configuration. Once it’s done, I suggest running flintrock -—help to get an idea on how you can actually use the cluster. A practical example would be using flintrock copy-file --master-only to copy a spark job to the master, and then invoking a spark-submit command, using flintrock run-command.

Prescribing More Cowbell

So far, we’ve discussed creating an On-Demand Instance-based Spark Cluster using Flintrock, in practicality, you will probably want to keep the number of slaves low (1–2 is often enough for Spark Streaming jobs), and increase it when you’re running batch data processing jobs. This next segment will discuss how to go about doing that, using Spot Instances to reduce costs.

You will first need to figure out the price to be placed as a bid on your spot request, use the Spot Instance Pricing History to find the price for your chosen instance type in your region (if you didn’t change that on the Flintrock config, that should be m3.medium). After you find the price, scaling out is as easy as running the following command:
flintrock add-slaves --num-slaves {{number of slaves}} --ec2-spot-price {{price}} cluster-name

In order to scale in, you will run the command:
flintrock remove-slaves --num-slaves {{number of slaves}} cluster-nameAs long the number of slaves is equal to the number of slaves requested by the previous command, the slaves that will be removed should be the ones made by the Spot requests.

Running all Instances as Spot

In some cases, you may want all nodes in your cluster to be launched on Spot Instances (which dramatically reduces costs).

In order to do that, simply invoke the flintrock configure command before launch a new cluster and change the value of providers:ec2:spot-price to a price appropriate for your instance type, as described in the above section.

Advice on Auto-Scaling Implementation

Implementation of auto-scaling using Flintrock greatly depends on the way you’re executing jobs on your cluster, hence is why I’m taking a more theoretical approach in this section.

The flow you want to achieve when running batch jobs with auto-scaling can be boiled down: scale out (add slaves) -> run task, wait for it to finish -> scale in (remove slaves).
An improvised way of doing that would be a simple script that executes the logic discussed above.

In some cases, it may be better to split the process into two scripts, one that adds slaves and runs the task, and another that checks periodically if the job has finished, if it has, runs the remove-slaves command. It could be handy to just run the second script as a cron job near the end of each hour.

Final Notes

As discussed above, setting up an Apache Spark cluster has many considerations. The method shown here is fast, but I wouldn’t necessarily call them production ready. I recommend spending some time reading the Spark docs after you have an initial, working cluster set up.
Flintrock’s default instance type is m3.medium. While it’s a very cost-effective machine when making spots requests, from our tests it seems that Spark typically prefers to have a small number of big machines rather than a big number of small machines. This is also more or less confirmed by this.
AWS is not the only cloud services provider on the market, unless you have specific reasons for using it over others (as we did), you should check out alternatives. Google Dataproc, for example, appears to be significantly more user-friendly and cost-effective than Amazon EMR.