Lakehouse — A resumé

Published in

claimsforce

5 min readMar 19, 2023

About this blog series

At claimsforce, our initial approach to big data was a two-tier architecture consisting of a Data Lake stage in Amazon S3 and a Data Warehouse stage in Amazon Redshift (outline here). Over time we realized that having two stages comes with disadvantages like engineering and maintenance effort, infrastructure costs, and data staleness. We aim to replace the combination of a Data Lake and Data Warehouse with a unified system — the Lakehouse. In this blog series, we will document our journey toward a Lakehouse setup.

Recapitulation

The first article of this blog post series outlined the challenges we are facing with the traditional Two-Tier architecture. We discovered how the Lakehouse approach helps us overcome those challenges and came up with an idea for our future data infrastructure. The second article explained how we ran Delta Lake on top of AWS Glue and what caveats this had. After that, we concluded that we had to look for a new ETL tool and compared two promising alternatives, Databricks and AWS EMR. We describe this in the third article.

With Databricks as our tool of choice, we dive into the last piece of this series that covers the migration process and a cost comparison.

Migration to Databricks

Migrating ETL jobs can be dangerous. Dashboards, ML services, and, thus, end users depend on tables fed by the jobs. Since we were modifying our architecture at two places by changing our ETL tool and switching from Redshift to Athena, we needed to be especially careful to avoid breaking existing user workflows. To minimize risk, we started running the new Databricks jobs, which write to S3, parallel to our existing Glue jobs that wrote into Redshift. For quality assurance, we loaded data from both sources into Spark and used the great Chispa framework to make sure the DataFrames contained the same data.

To keep track of which files we have already processed, we used the Job Bookmark feature of Glue. As this is unavailable on Databricks, we had to look for an alternative mechanism allowing incremental processing. As it turns out, Spark has such a feature out of the box if you use the Streaming API.

One common misconception is that Spark Streaming automatically means “Always-on” expensive clusters. Instead, you can run incremental jobs using the Trigger.Once feature. Unfortunately, we were unaware of Databricks Auto Loader back then, which does essentially the same but is more scalable and performant while potentially being cheaper due to the usage of native cloud APIs to get lists of files that exist in storage.

Integrating with Athena

Athena did not support working with Delta Lake tables natively back when we were migrating our infrastructure. That meant that Athena could not leverage features like data skipping and could not identify tombstone files that were no longer part of the table. To prevent Athena from reading stale data, you had to create a manifest file containing the list of files representing the table’s latest version. Therefore we wrote a helper function that updates the manifest file for a given table every time it gets updated.

At re:invent 2022, AWS announced that Athena can now natively read from Delta Lake tables. There are still some caveats, DML is not supported, and you need to register the table in the Glue catalog, but all in all, this is a vast improvement from before and a significant step in the right direction.

Outlook

Finally, let us look into the future at what we could improve. One big goal is migrating from Athena to Databricks SQL since they now support serverless querying. That would enable us to fully use the Unity Catalog, which provides sophisticated security features and automated lineage tracking.

Additionally, we want to contribute to open-source projects centered around Delta Lake, like Mack, Jodie, or Levi, since this migration made us appreciate open-source tech even more than before.

Cost Comparison

Now to the fun part, looking at the costs and evaluating whether the migration was successful. First, a look into the costs generated by AWS Glue vs. Databricks + EC2 costs.

As you can see, we reduced our costs by nearly a factor of 8! That is attributed to the fact that Glue is just more expensive than Databricks and EC2 combined and that we can run multiple jobs on the same cluster, reducing costs for start-up times. We also used the migration to do some housekeeping, merging jobs and removing jobs that aren’t of use anymore. Furthermore, the Photon Engine could speed up complex jobs and lower total costs.

Looking at Redshift vs. Athena, we can see that our costs also decreased by 75%. Since our querying workload consists mainly of report queries running in a set interval instead of ad hoc analytics, we benefit tremendously from the switch to serverless.

Conclusion

Overall, we are more than satisfied with the outcome of our Lakehouse migration.

We reduced our overall costs by 80% while improving our developer experience drastically. We don’t have to maintain a Redshift cluster anymore. Instead, we store all the data in a single place, S3. Further, the core of our infrastructure is powered by open source, namely Apache Spark and Delta Lake. That empowers us to move away from Databricks if we are ever unhappy with the service.

I hope you enjoyed our short journey. If you’re interested, there is a tech talk with Denny Lee and me where we dig a little deeper into Delta Lake.

written by: Johannes Kotsch & Robert Kossendey