The step by step overview of the cost tuning strategy

A person underlining “CONCLUSION”
A person underlining “CONCLUSION”
Photo by Andrainik Hakobyan on Shutterstock

If you haven’t read the entire series of these Apache Spark cost tuning articles, then the changes recommended in this summary may not make sense. To understand these steps, I encourage you to read Part 1 which provides the philosophy behind the strategy, and Part 2 which shows you how to determine the estimated costs for your Spark job. The full series is linked below, after this summary of the steps you should take:

  1. If executor core count changed then adjust executor count…


How to resolve memory issues that happen when switching to efficient executor configs

An image representing the word ‘ERROR’ in binary code
An image representing the word ‘ERROR’ in binary code
Photo by vchal on Shutterstock

When switching to the cost efficient executor configuration, sometimes your tasks will fail due to memory errors. In this blog, I will mention three fixes you can try whenever facing these errors. Before I do that, I will cover how to lookup task failures in the Spark U/I. If you are already comfortable with identifying task failures in the Spark U/I, jump to the Overhead Memory Error section.

Researching Failed Tasks

A quick refresher about the hierarchy of a Spark job. A Spark job is divided into jobs. Each job is divided into stages. And stages are divided into tasks.


Steps to follow when converting existing jobs to cost efficient config

An image representing a migration service key on a keyboard
An image representing a migration service key on a keyboard
Photo by kenary830 on Shutterstock

There are a number of things to keep in mind as you tune your Spark jobs. The following sections cover the most important ones.

Which jobs should be tuned first?

With many jobs to tune, you may be wondering which jobs to prioritize tuning first. Jobs that only have one or two cores per executor make a great candidate for conversion. Also, jobs that have 3000 or more Spark core minutes also make good candidates. I define a Spark core minute as…

Executor count * cores per executor * run time (in minutes) = Spark core minutes

Make sure you are comparing apples to apples

When converting an existing job to an efficient executor configuration, you will need to change your executor count whenever your executor core count changes


Find the most cost efficient executor configuration for your node

An image representing a control for efficiency
An image representing a control for efficiency
Photo by Sashkin on Shutterstock

CPUs per node

The first step to determine an efficient executor config is to figure out how many actual CPUs (i.e. not virtual CPUs) are available on the nodes in your cluster. To do so, you need to find out what type of EC2 instance your cluster is using. For our discussion here, we’ll be using r5.4xlarge which according to the AWS EC2 Instance Pricing page has 16 CPUs.

When we submit our jobs, we need to reserve one CPU for the operating system and the Cluster Manager. …


I outline the procedure for working through cost tuning

Photo by Fabrik Bilder on Shutterstock

Below is a screenshot highlighting some jobs at Expedia Group™ that were cost tuned using the principles in this guide. I want to stress that no code changes were involved, only the spark submit parameters were changed during the cost tuning process. Pay close attention to the Node utilization column that is highlighted in yellow.


How I saved 60% of costs in an Apache Spark job, with no increase in job time and no decrease in data processed

Photo by Blackboard on Shutterstock

Until recently, most companies didn’t care how much they spent on their cloud resources. But in a covid-19 world, companies like Expedia Group™ are reducing cloud spending where reasonable. While many Apache Spark tuning guides discuss how to get the best performance using Spark, none of them ever discuss the cost of that performance.

This guide will discuss how to get the best performance with Spark at the most efficient cost. It will also discuss how to estimate the cost of your jobs and what makes up actual costs on AWS. …

Best practices for detecting bad data before it spreads

Photo by Mika Baumeister on Unsplash / help by Dinosoft Labs from the Noun Project

In an age when HomeAway processes petabytes of data on a daily basis, data quality is critical to ensuring the right decisions are being made with our data. But how does one know if their data is good? If polluted data gets introduced into an upstream data source, then downstream data sources will get polluted as well unless that bad data is detected. How can you ensure your data is good?

In a former role, I maintained a database that generated a metric which affected the annual bonuses of over a 100 people. I had no system of truth to…

Brad Caffey

Staff Data Engineer at Expedia Group.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store