Benchmarking Amazon EMR vs Databricks

Published in

Insider Engineering

8 min readJan 31, 2022

At Insider, we use Apache Spark as the primary data processing engine to mine our clients’ clickstream data and feed ML-ready data into our machine learning pipelines to enable personalizations. We have been using Spark since version 1.5 and always looking for ways to improve efficiency. If you are interested too, check out our blog post about how Spark 3 reduced our Amazon EMR cost by 40%. To further improve our platform’s efficiency, we decided to conduct a trial with the Databricks platform.

Current Spark Usage and Pain Points

Before moving forward with the Databricks platform and the benchmarks, let’s see how we utilize Apache Spark and Amazon EMR, and the pain points to understand better our current solutions and challenges.

Structured Streaming and State Management

First of all, we run a couple of Structured Streaming applications that read real-time data from Kinesis Data Streams, transform the data, and write the results to databases with respect to their purposes. The first issue is that there is no official library to use a Kinesis Data Stream as a data source to the Structured Streaming application. You can only use the old DStream integration, which uses the RDD API that we do not prefer. There is a third-party solution from Qubole that we are currently using to leverage the DataFrame API with Kinesis Data Streams. The Databricks platform offers Kinesis Data Streams integration for Structured Streaming out of the box.

Kinesis Data Streams and Spark Streaming integration. (Source)

Other than the native library support there is also the performance aspect. As we are running an aggregation query with a wide time window there happens to be tens of millions of states to be maintained. After having more than 1–2 million states per executor, the streaming performance starts to degrade due to long garbage collector pauses as Spark keeps those states in JVM. Databricks has a solution for keeping the states out of JVM, in a key-value store called RocksDB from Facebook Open Source, which enables keeping up to 100 million states per executor without GC issues. The good news is that with the latest release (Spark 3.2) this feature is included in the open source version as well, which was not the case during the trials.

ETL and Machine Learning

Most of the compute hours we are spending using Spark are for ETL and machine learning purposes for sure. Dozens of ETL jobs in various data pipelines are continuously running on raw and intermediate data to generate analytics, insights, clean data, ML-ready data, etc. Spark is the workhorse for our machine learning pipeline as well. There are numerous Spark applications for tasks like feature extraction, feature selection, model training, inference, and recommendations. As we are running hundreds of clusters for thousands of Spark jobs with varying dataset sizes, optimizing these operations and hence the cost is another challenge.

Differentiations of the Databricks Platform

The Databricks platform has some interesting features that differentiate it from the public cloud providers’ services. The most obvious one is the ability to use new Apache Spark versions quickly, I am looking at you EMR. Let’s check out the other interesting ones.

Photon Query Engine

Databricks Runtime with Photon from the Databricks website.

For me, the Photon Runtime is by far the most appealing feature of the Databricks platform. Photon is a new native vectorized engine entirely written in C++ for sheer performance and is fully compatible with Apache Spark APIs. The goal is to meet the data warehouse performance without having your data in a data warehouse, aligning with the lakehouse vision of the company. It is exciting to accelerate the existing workloads by simply changing the underlying runtime.

However, as the technology is currently in preview, there are many limitations to be considered such as the lack of support for window and sort operators, streaming applications, and UDFs. Keep in mind that to benefit from the Photon runtime, you should have big datasets and long-running queries. As the performance gain on small data is minimal and the cost of the runtime is almost triple of the standard Apache Spark runtimes.

Delta Engine

Then there is the Delta Engine with the Delta Lake optimizations, which packs a native execution engine (Photon) with optimizations to Spark’s query optimizer and caching capabilities. Again the goal is to accelerate the performance and enable easier scaling on lakehouse architecture. Delta Engine in Databricks Runtime utilizes caching by automatically keeping the input data in SSD with a more CPU-efficient format for better query performance, without the user managing the caching strategy unlike the in-memory cache for Spark.

Auto Optimize for Delta Lake

An illustration of how Auto Optimized writes work.

Another cool feature for Delta Lake is the Auto Optimize, one of my favorites. It automatically compacts small files during the writing process to a Delta table. Of course, there is an overhead to optimize the partitions, but if the table is frequently used for read operations the benefit is significantly higher as the small files lead to performance degradations. Databricks dynamically optimizes Apache Spark partition sizes based on the actual data and attempts to write out 128 MB files for each table partition, which mostly guarantees to avoid the small files syndrome.

Z-Ordering for Delta Lake

An example iteration for the Z-order curves from Wikipedia.

Last but not least, Delta Lake has a feature called Z-Ordering; a technique to colocate related information in the same set of files. This co-locality is automatically used by Delta Lake on Databricks data-skipping algorithms to dramatically reduce the amount of data that needs to be read. To enable it, you specify the columns in a Delta table with a ZORDER BY clause in an OPTIMIZE query. The suggested columns to be ordered are the ones that are mostly used in read operations for filtering and have high cardinality, for the best performance.

Pricing

In order to run the Databricks Runtime, you need to use a public cloud provider like AWS, GCP, and Microsoft Azure. As we are comparing the platform with Amazon EMR, the choice is obvious.

Both Amazon EMR and Databricks Runtime run on EC2 instances, therefore you are billed for all underlying EC2 costs on AWS. The Amazon EMR service has an additional hourly price with respect to the selected instance types.

On the other hand, it is more complicated for Databricks as there are three pricing tiers and five different compute types within each tier, which of course also depends on the selected instance type. Running notebooks require All-Purpose Compute type which is insanely expensive ($0.99 for r5.2xlarge).

All the performance tests and pricing comparisons below are based on Premium tier and Jobs Compute type, which is 114% more expensive than the Amazon EMR prices. (Calculated based on hourly prices for an r5.2xlarge instance as $0.126 for Amazon EMR and $0.270 for Databricks.)

Performance Tests

We have performed several benchmarks with the core jobs of our machine learning pipeline. Runtime versions are Amazon EMR 6.3 and Databricks DBR 9.0. All clusters are running a single r5.xlarge master instance and five worker instances with different instance types as test cases.

Cost calculations include both service (EMR/DBR) and EC2 costs, which is what you pay at the end of the day. All tests are performed on optimized Delta tables, and baseline is selected as Hive tables stored in ORC format, which are not optimized. Photon runtime is not benchmarked, as our use cases are mostly not supported by its limitations and we did not see an improvement as a result.

The following charts show the results of the conducted performance tests. Each chart represents a single job. The x-axis represents the platform and the worker instance type, while the y-axis shows the percentage of runtime speedup in blue and cost savings in red bars. For both metrics, the higher is better.

This feature extraction job generates high-level feature vectors from raw data to be stored in our feature store. It is a big aggregation query with a hundred different combinations of aggregation functions and input columns. This test run generated 20 million aggregate rows. The most speedup is brought by Databricks, but the slowest EMR cluster was the cheapest one.

This ETL job reads from a table with tens of thousands of partitions, performs aggregations and some computations. The Delta table enabled a dramatic speedup, as the partition discovery from Hive metastore seemingly causes a bottleneck. Even the non-optimized Delta table enabled 30–40% performance improvement over the Hive table, which is impressive. Both Amazon EMR and Databricks performed similarly on the optimized Delta table, therefore Amazon EMR being the cheaper one.

This other ETL job reads data from a table with far less partitions and performs fewer aggregations but more join operations. The overall speedup is lower than the previous example, with Databricks performing better, but Amazon EMR being cheaper.

The model training job reads an ML-ready dataset and trains a Spark ML model using that dataset. As expected, the speedup is not as significant as the data transformation jobs. Databricks achieves better performance for Spark model training, but at a higher price.

This batch inference job reads a trained Spark ML model and feature vector dataset to perform the inference task on the dataset. Similar to the model training task Databricks performs better, with Amazon EMR being cheaper.

When we evaluate the overall results, the Databricks Runtime has a better performance on Delta tables when the suggested configurations are applied. However, the performance-optimized EMR runtime for Apache Spark also performs closely, and due to the at least double pricing of the DBUs compared to AWS, Amazon EMR achieves being cheaper for our cases.

Conclusion

Databricks platform has great features to decrease the burden on the engineers’ shoulders. But when it comes to costs our tests showed that we can achieve similar cost optimizations on Amazon EMR with Delta tables as well. The Databricks platform offers practical out of the box capabilities for data scientists and teams that do not have an infrastructure for data engineering and machine learning. It also makes getting to production easier with scheduled notebook jobs and managed MLflow. But if you are managing your cloud architecture for data pipelines and looking for a service for running your Spark jobs as we do, it seems like the Amazon EMR is the cheaper solution out there, which also offers different platform features as EMR on EKS and EMR Serverless.