AWS Glue + Apache Iceberg

Bringing ACID operations to Apache Glue and SparkSQL

Robert Sanders
Software Sanders

--

AWS Glue + Apache Iceberg

Motivation

At Clairvoyant, we work with a large number of customers that use AWS Glue for their daily ETL processes. Many of these Glue jobs leverage SparkSQL statements to make transformations easier to understand and more readable.

We’ve been looking to identify ways to further make these sorts of SparkSQL operations easier. Mainly through providing ACID Operations (UPSERTs, INSERTs, UPDATEs, and DELETEs) in SparkSQL. For example, performing a DELETe would be simpler from an understanding and execution perspective rather than doing an INSERT OVERWRITE back into a table as you would typically do in Spark. There is a new file format that provides just that: Apache Iceberg.

This can also potentially help to improve AWS S3 costs and storage efficiency with the ability to only store Delta data and metadata:

This post will describe how you can configure you AWS Glue job to use Iceberg in SparkSQL through some simple examples.

--

--

Robert Sanders
Software Sanders

Senior AVP of Data Management for EXL Services | Marathon Runner | Triathlete | Endurance Athlete