Spark 3 Reduced Our EMR Cost by 40%

Published in

Insider Engineering

4 min readJun 6, 2021

One of the great reasons for upgrading to Spark 3 and EMR 6

Spark 3.0 has been released in June 2020 and arrived at the AWS EMR service with EMR 6.1 version in September 2020. After almost a year, we managed to upgrade our Spark version to 3.1.1 and enjoyed a dramatic performance increase.

But why it took us too long to perform the version upgrade? Well, we had a couple of major external blockers.

First of all, we have waited a quite long while for all our third-party dependencies to release their Spark 3 versions. The latest to the party was the Elasticsearch Spark release which made us wait until March 2021.

Secondly, a bug (SPARK-33398) in Spark 3.0.1 caused tree-based models in a PipelineModel to fail if it was trained using a Spark 2.x version. That was another major blocker as it was breaking our machine learning pipelines. The fix was included in Spark 3.1.1 which has arrived with EMR 6.3 on May 2021.

Our data pipelines utilize Spark heavily for ETL and machine learning workloads. We use Spark for raw data processing, feature extraction, feature selection, dataset preparation, model training & tuning, and inference steps in our machine learning backed products. Therefore, upgrading to Spark 3 to benefit from the performance increase was a top priority for us.

Alongside the core Spark enhancements such as the new Adaptive Query Execution, Spark 3.0 also resolves the infamous S3 rename committer issue by adding a high performance S3A committer, see SPARK-23977. The previous S3 committers were uploading files in a temporary location in S3 and then renaming them to their final destination. Since S3 is an object storage, the rename operation is defined as consecutive copy and delete operations, which becomes costly pretty easily with a large amount of data. This was a big pain point and new S3A committers reduced our S3 write time from minutes to seconds!

One more additional capability that EMR brings is to be able to use Graviton2 instances in EMR after EMR 6.1, and also 5.31 if you still want to try on Spark 2 workloads. Graviton2 instances such as M6g, C6g, and R6g provide %30 lower on-demand cost and 10% lower spot instance cost. So do not miss the benefit of Graviton2 instances.

Before upgrading to Spark 3 we performed lots of functional and performance tests. Since the S3A committer itself makes a great difference we decided to measure performance increase as total time reduction and normalized time reduction, which is simply calculated by subtracting time spent during S3 and database writes from the total job time.

We have tested tens of different Spark applications and had a considerable performance increase in most of them. The results for the ones with the most common use cases and operations are listed below.

The data aggregation jobs mainly read different datasets from S3, perform various join and aggregation operations on top of them and write the results back to S3. As the final datasets are in the order of hundreds of gigabytes we see great time reductions (57%-77%) thanks to S3A committers. The compute performance increase (normalized time) is also pretty good ranging from 12% up to 28%.
The model training job reads a dataset from S3 and trains a machine learning model using the Spark MLlib library. This is a broad use case and we experienced a 37% decrease in training time. Since S3 is not involved, there is no normalized time for the model training test.
The inference job loads a pre-trained machine learning model and reads a dataset from S3 to perform inference. The batch inference time is decreased by 32% in our tests. Also, the inference results are written back to S3, the overall time reduction is 63% including the persistence stage.

Job duration chart for a data aggregation job from the production environment. Blue line is the migration day.

As you have seen the test results were mind blowing. The actual runtimes of jobs are reducing by a wide range from 20% up to 80% which should reduce EMR costs dramatically. Well, the actual EMR costs went down by 40% after the migration is completed.

Apache Spark is a great engine for processing a large volume of data and perform machine learning on top of it. Amazon EMR makes it easy to set up, operate, and scale Apache Spark environments by automating the capacity provisioning and running clusters. Additionally, EMR runtime provides up to 3x performance over standard Apache Spark. Now with Spark 3, Amazon EMR becomes much cheaper to run Spark workloads.

Spark 3 Reduced Our EMR Cost by 40%

Written by Deniz Parmaksız