Convert existing Amazon S3 data lake tables to Snowflake Unmanaged Iceberg tables using AWS Glue

Contributor — James Sun, Partner Solutions Architect, Snowflake

Overview

Many companies have built their data lakes on Amazon S3, specifically for analytical tables using the parquet format. Snowflake can directly query these tables by mapping External Tables to the parquet data on S3, allowing these companies to combine data lake data with Snowflake managed data. Apache Iceberg drastically enhances the data lake capabilities with features like ACID transactions, rollback and time travel, managed partitioning, schema evolution and many other benefits. In addition, Snowflake’s support for an external Iceberg catalog enables you to tap into these Iceberg capabilities with the added benefits of simpler management and increased performance.

Rewriting the data into Iceberg format can be very costly and time consuming for companies with large volumes of data stored in their S3 data lakes in parquet format. In this blog we will discuss how to leverage AWS Glue to convert S3 parquet data lake tables to Iceberg format without the need to rewrite the actual parquet files, but just adding the Iceberg metadata. This can be a big time and cost saver enabling the benefits of tapping into this data in a much more efficient manner with Snowflake. AWS Glue also simplifies this process by using a serverless environment to perform this action that can scale for large operations.

Workflow Representation

Basic workflow steps

  1. AWS Glue reads the parquet table DDL and creates an empty Iceberg table entry in the AWS Glue Data Catalog
  2. AWS Glue scans the parquet files on S3 to collect metadata and other information
  3. AWS Glue writes the Iceberg Metadata and Manifest files to S3 and updates the AWS Glue Data Catalog Iceberg table entry
  4. Snowflake reads the Iceberg table metadata from the AWS Glue Data Catalog and creates an Iceberg table reference in Snowflake. This can also easily be refreshed at any time
  5. At query time Snowflake uses the Iceberg metadata to optimize access of the parquet files in S3 for increased query performance

Upgrade S3 tables to Iceberg format using AWS Glue Studio Notebook

For the purposes of this blog we will assume the metadata for the S3 parquet tables are already in the AWS Glue Data Catalog (GDC). The AWS Glue Crawler can be used to generate the metadata for the GDC if it is not in the GDC. Also create a new AWS Glue database that will be used for the Iceberg tables.

In the AWS Console open AWS Glue > ETL Jobs > Notebooks on the left menu then select Create job > Notebook.

Select the Spark(Python) engine and select an IAM role with sufficient privileges to update the GDC and read/write to the S3 location(s) where the parquet data is and where Iceberg metadata will be written.

This will create a Jupyter notebook session with pre populated cells. We will create and run several cells to perform the table upgrade.

First we will configure the execution environment. Create a new notebook cell and paste the code below in it. You can select a larger number of workers for large data sets and also select more powerful AWS Glue Worker types (G.1X can be used as the default).

Note in each of the following notebook steps you will need to create a new cell, copy and edit the code as needed, then execute the cell in the notebook.

%session_id_prefix iceberg-upgrade-add-files-
%glue_version 4.0
%idle_timeout 300
%number_of_workers 2
%worker_type G.1X
%%configure
{
"--conf": "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"--datalake-formats": "iceberg"
}

Next we will set the variables for the notebook. In this example we will write the Iceberg metadata to the same bucket and path as where the parquet data files are located. We will however use a different GDC database and table name for the Iceberg table than the parquet table.

Note: Please enter the correct values for your environment in the fields encapsulated by <>

catalog_name = "glue_catalog"
bucket_name = "<bucket name>" # your Amazon S3 bucket name for this post
bucket_prefix = "<bucket prefix path to the parquet files>"
database_name_par = "<GDC database name for the parquet files>"
database_name_ice = " <GDC database name for the iceberg tables> "
source_table_name = "<name of the source parquet table in the GDC>"
target_table_name = " <name of the Iceberg table that will be created> "
warehouse_path = f"s3://{bucket_name}/{bucket_prefix}"

Next we will initiate a Spark session.

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
.config(f"spark.sql.catalog.{catalog_name}.warehouse", warehouse_path) \
.config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
.config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
.getOrCreate()

Next we’ll create 2 cells to verify the parquet data table can be read from the notebook.

%%sql
USE <GDC database name for the parquet files>

Next cell — Verify the table you want to convert is listed.

%%sql
SHOW TABLES

Next you can drop the Iceberg table (in case it may exist)

query = f"""
DROP TABLE IF EXISTS {catalog_name}.{database_name_ice}.{target_table_name}
"""

spark.sql(query)

Next we will create a temporary view to capture the table DDL from the existing parquet table.

This step can avoid issues where the Spark engine may not want to generate DDL from parquet based tables for Iceberg tables.

Also remember to replace the <> sections with your information.

%%sql
CREATE OR REPLACE TEMPORARY VIEW tmp_<temp_view_name> AS
SELECT * FROM <parquet_table_name> LIMIT 0

Now we can create the Iceberg table in GDC.

query = f"""
CREATE TABLE IF NOT EXISTS {catalog_name}.{database_name_ice}.{target_table_name}
USING iceberg
AS SELECT * FROM tmp_ <temp_view_name>
"""
spark.sql(query)

The next step will read the parquet data files and generate the Iceberg metadata files, as well as update the GDC with the Iceberg information. This is considerably faster and less expensive than recreating all the data with a CTAS.

query = f"""
CALL {catalog_name}.system.add_files(table => '{database_name_ice}.{target_table_name}', source_table => '{database_name_par}.{source_table_name}')
"""

Lastly you can verify the Iceberg metadata by listing the snapshots and manifest location. The Iceberg table is now ready for use.

query = f"""
SELECT snapshot_id, manifest_list FROM {catalog_name}.{database_name_ice}.{target_table_name}.snapshots
"""

spark.sql(query).show(10, truncate=False)

Configure Snowflake to access the Iceberg table(s)

Snowflake can now access the Iceberg tables by performing a few simple steps.

The Iceberg tables can now be used in Snowflake with the added performance and benefits of Iceberg table format.

To refresh the metadata in Snowflake for the Iceberg table simply run a refresh.

Conclusion

Using Iceberg format for data lake tables offers many benefits as discussed in this blog, including ease of use with Snowflake and improved performance as compared to parquet based external tables. The advantage of converting the data in-place as opposed to rewriting all the data is that it reduces the time and cost of the table format conversion, however the historical parquet files may not be as optimized for performance.

There are several considerations to keep in mind when using the process described in this blog to convert to Iceberg format data lake tables. First the tables cannot be changed or updated by a data pipeline while the conversion is in progress as it will invalidate the generation of Iceberg metadata. Once the table format is converted to Iceberg, it will require the data pipelines to use the new Iceberg tables as destination, which will require the data pipeline tools to support Iceberg. Various AWS services, like AWS Glue, already provide full support for data pipelines with Iceberg format.

--

--

Andries Engelbrecht
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Andries Engelbrecht is a Principal Partner Solution Architect at Snowflake working with strategic partners.