EXPEDIA GROUP TECHNOLOGY — SOFTWARE

Part 2: Real World Apache Spark Cost Tuning Examples

I outline the procedure for working through cost tuning

Brad Caffey

Published in

Expedia Group Technology

5 min readAug 6, 2020

If you like this blog, please check out my companion blogs about how to use persist and coalesce effectively to lower job cost and improve performance.

Below is a screenshot highlighting some jobs at Expedia Group™ that were cost tuned using the principles in this guide. I want to stress that no code changes were involved, only the spark submit parameters were changed during the cost tuning process. Pay close attention to the Node utilization column that is highlighted in yellow.

Shows examples of spark jobs with node configurations before and after cost tuning and the cost savings made — Cost reductions of Apache Spark jobs achieved by getting the node utilization right — costs are representative

Here you can see how improving the CPU utilization of a node lowered the costs for that job. If we have too many Spark cores competing for node CPUs, then time slicing occurs which slows down our Spark cores which in turn hampers job performance. If we have too few Spark cores utilizing node CPUs, then we are wasting money spent for that node’s time because of the node CPUs that are going unused.

You may also notice that perfect node CPU utilization was not achieved in every case. This will happen at times and is acceptable. Our goal is to improve node CPU utilization every time we cost tune rather than trying to get it perfect.

Actual Job Costs

Determining actual job costs is pretty difficult. At Expedia, we built pipelines that combine hourly cost data from AWS with job level cost allocations from Qubole that help us determine actual job level costs.

For those who don’t have pipelines built to determine their job costs, check with your Data Management Platform. Qubole recently introduced Big Data Cost Explorer to help their users easily identify job costs. For EMR users, AWS provides Cost Explorer which you can learn to setup via this link.

AWS EC2 Pricing

Mostly EC2 instance cost, but also costs from EBS volume, data transfer, and data management platform — A breakdown of AWS EC2 pricing

Let’s dig a little into AWS pricing for Spark jobs since most platforms use AWS for their cloud computing needs. The biggest cost for Spark jobs is by far the EC2 instances/nodes used during the job. With that said, there are other charges from AWS that may impact the cost of your job in certain situations.

Data Transfer: This charge is for transferring data out of AWS as well as transferring data between AWS regions. This charge is on a per GB basis. Loading data into AWS or operating with data already on AWS is free.

EBS volumes: This charge is for any AWS storage utilized while persisting datasets during your Spark job to EBS volumes. To be clear, this charge is not for writing data to S3. Instead, this charge is for when a Spark job persists data to disk. Again, if no data is persisted then this charge will be zero. This charge is at a per GB per Month level.

And finally, whatever Data Management Platform tool you are using (Qubole, EMR, Databricks, etc) also has a charge they will add onto your job. In EMR’s case, they will add a per second per instance surcharge on top of your EC2 instance costs that will also need to be accounted for. Qubole adds a similar charge to its QCU (Qubole Compute Unit) values.

Estimating job costs

For the purposes of this guide, I’m only going to focus on how to estimate the costs of your EC2 instances since that’s the biggest factor affecting the overall costs of your jobs. The other charges are either nominal or will scale linearly with your EC2 instance costs.

Before I do, I need to mention there are several methods for reducing EC2 instance costs — like dynamic allocation and spot nodes — that we are going to ignore for this exercise. We are ignoring these additional cost reduction methods because they don’t impact the goal of what we are trying to achieve…aligning executors across the optimum amount of nodes.

Because dynamic allocation complicates the estimation process with the number of executors changing wildly throughout a job, I recommend that you disable dynamic allocation during cost tuning when trying to estimate job costs. We do this so we can gauge the efficiency of the executor. You can turn dynamic allocation back on when you are done cost tuning to save more money.

We estimate job costs by doing the following.

1) Determine how many executors should fit on a node by taking total executor memory size (executor_memory + overhead memory) divided by available node memory.

2) Determine EC2 cost for your node type by looking on this AWS EC2 Instance Pricing page.

3) Multiply node counts used by EC2 node cost by run time (as expressed in fractional hours) .

While the costs calculated will only be an initial estimate, these are valuable because we can use them for comparison with job costs of various configurations.

In the next part, we will look at how to determine what are the efficient executor configurations for the EC2 nodes your batch Spark job runs on.

Series contents

Learn more about technology at Expedia Group