Lakehouse — Databricks vs. AWS EMR

Published in

claimsforce

6 min readJan 5, 2023

Disclaimer: The decision on which ETL tool to use took place in May, EMR Serverless was in preview, and it did not support Delta Lake natively back then.

About this blog series

At claimsforce, our initial approach to big data was a two-tier architecture consisting of a Data Lake stage in Amazon S3 and a Data Warehouse stage in Amazon Redshift (outline here). Over time we realized that having two stages comes with disadvantages like engineering and maintenance effort, infrastructure costs, and data staleness. We aim to replace the combination of a Data Lake and Data Warehouse with a unified system — the Lakehouse. In this blog series, we will document our journey toward a Lakehouse setup.

Recapitulation

The first article of this blog post series outlined the challenges we are facing with the traditional Two-Tier architecture. We discovered how the Lakehouse approach helps us overcome those challenges and came up with an idea for our future data infrastructure. The second article explained how we ran Delta Lake on top of AWS Glue and what caveats this had. After that, we concluded that we had to look for a new ETL tool.

Finding the Perfect ETL Tool

As part of our search for the best ETL tool for our data pipelines, we looked at several different options. The main requirement was that the ETL tool runs PySpark code since we wanted to keep our existing scripts the same upon this migration.

We started by compiling a list of potential tools and comparing their features and capabilities. Because of their integration with S3, developer experience, scalability, performance, and pricing, we narrowed our choices to a shortlist of two tools: Amazon EMR and Databricks. We then took a closer look at these two options, comparing them in different aspects against our current tool, AWS Glue. As evaluation criteria, we considered ease of use, features, vendor lock-in, and pricing. The following presents the results of the comparison.

Ease of use

Both solutions support more Spark versions, including more recent ones than Glue. That is useful if you want to take advantage of the latest features and improvements in Spark or when you need to use a specific version to support certain integrations or features. EMR and Databricks require a higher configuration and setup level than Glue. That involves choosing and configuring the instance types and sizes and selecting the software and framework versions. While this level of control can be beneficial in some cases, it also requires more time and expertise to get things set up and running smoothly.

EMR’s developer experience could be better, especially with Serverless. It is less intuitive and more laborious compared to Glue. That makes it harder to build and maintain data pipelines, especially for developers who are new to the platform.

Databricks has built-in support for Delta Lake, so you do not have to set up the dependencies or manage compatible versions. In addition, the Databricks runtime environment already includes many libraries commonly used, like Boto3, NumPy, and seaborn. That makes setting up and starting data processing easier, as you don’t need to install and manage these dependencies separately. It provides a collaborative notebook environment for Data Scientists, Data Engineers, and Analysts to work together on data projects. When we first tried it out, we were impressed by how easily the collaboration worked. Also, the fact that you have a REPL-like environment for SQL, Python, and Scala Code makes experimenting with data a breeze. Databricks has a very user-friendly interface and comprehensive documentation, making it easy for developers to get started.

Features

EMR supports a broader range of processing engines and big data frameworks, including not only Spark but also Hadoop, Flink, Trino, and others. That gives you more flexibility and choice when extending your data architecture.

One disadvantage of using EMR is that it requires an additional orchestration service to manage the data processing jobs and workloads like AWS Step Functions, AWS MWAA, or another tool. That adds a layer of complexity to your architecture and requires additional setup and maintenance. By contrast, Glue provides a natively integrated orchestration service, which makes it easier to set up and manage your ETL jobs.

As Glue, Databricks has built-in orchestration capabilities, allowing users to schedule and automate data pipelines. However, Databricks also allows for the orchestration of SQL queries, dbt, or even ML models.

The Unity Catalog in Databricks provides a centralized location for storing, accessing, and managing data assets. The catalog also includes lineage capabilities, allowing users to track the history of data transformations and the lineage of data assets on a column level. Additionally, Databricks inherently supports MLFlow for Data Scientists and offers serverless SQL processing for Business Analysts. That means we could switch out AWS Athena in the future, unleashing the power of all the Delta functionalities that Athena currently does not support.

Vendor Lock-In

Because EMR and Databricks have service-specific configurations, both come with a degree of vendor lock-in. Databricks offers more custom functionalities like Delta Live Tables, the Unity Catalog, or the proprietary processing engine Photon. Since there is no open-source version of those services, Databricks has more vendor lock-in than EMR.

Costs

EMR is generally cheaper than Glue, especially for large data processing tasks. There are two reasons for that. Firstly, compute hours simply cost less than using Glue. Secondly, EMR supports spot pricing, which allows you to bid on spare Amazon EC2 instances and potentially get a significant discount on your compute costs. Using spot instances is fine with Spark applications since it handles node failures in a fault-tolerant manner.

Databricks is also more cost-effective than AWS Glue, especially when leveraging AWS Graviton instances and enabling the vectorized query engine Photon that can substantially speed up large processing jobs. As EMR, Databricks supports spot instances, which reduce costs. Because it is a separate product from AWS, it has its own billing and pricing model, making it more annoying to manage and track expenses compared to using a single tool integrated with the rest of the AWS ecosystem.

EMR vs. Databricks

In summary, Databricks and EMR are both mature and popular options for data processing and analysis in the cloud, making them valid replacements for AWS Glue. EMR has the advantage of supporting a broad range of processing engines and big data frameworks, as well as being generally cheaper than Databricks. Being part of the AWS ecosystem, EMR allows us to have unified billing and leverage our existing IAM integration to ensure the least privilege principle. However, it requires additional orchestration and data governance services and has a worse developer experience.

In contrast, Databricks has many features that increase developer productivity and enable Analysts and Data Scientists to work on the same platform and data while offering state-of-the-art security mechanisms. Nonetheless, it is more expensive while having the potential for vendor lock-in with proprietary technologies.

Decision

Comparing the two options, we concluded that Databricks would be the better fit for us. The few arguments speaking in favor of EMR, like unified billing and easy IAM integration, were not enough to convince us. What we liked especially about Databricks is the holistic approach they chose instead of the variety of thin products from AWS, like LakeFormation, MWAA, and Glue, that you would need to integrate first.

In the last part of this series, we will talk about the migration to Databricks and give our final resume on the journey.