Creating and managing Apache Iceberg tables using serverless features and without coding

“Goals are about the results you want to achieve. Systems are about the processes that lead to results.” — Author: James Clear in his book Atomic Habits

With the modern data explosion and the need to derive insights and value from all that data, one thing is clear: a traditional data lake alone isn’t enough. You need a data solution to draw analytics and intelligence, and this solution needs to scale and be cost-effective at the same time.

One way to manage the data explosion is to use a technology that supports large volumes of data and can add structure to it, like a big data platform. Basic building blocks of this architecture are file formats and table formats. There are many open-source and performant file formats like Apache Parquet and Avro, but we need something that doesn’t rely on the physical file or its folder location and yet is able to deliver high performance and analytics at scale.

Few open source technologies allow us to build effective tables/catalogs without the side effects of partition evolution, data inconsistency, etc., that we see in old, big data technologies, such as Hive and Hadoop Therefore, many people seek to build their modern data architecture with new open-source technologies that can serve large-scale analytics.

In my quest for a scalable data architecture and performant table format, I came across Apache Iceberg Tables and want to try a few things quickly and cost-efficiently. Iceberg tables provide good query performance and metadata management over very large analytical data sets. So, I built an Iceberg table using the AWS serverless feature with just enough engineering to help me play with Apache Iceberg. Just basic knowledge of SQL is enough to complete this lab.

Creating Apache Iceberg table with just SQL

What is Apache Iceberg?

In a few words, it is an open-table format that supports huge analytical workloads. Plus, many industry-leading analytical and cloud platforms like Snowflake, AWS, Spark, Trino, Presto, etc., are signaling they will support the format. Read more in the Apache Iceberg documentation.

Today, we are going to build an Apache Iceberg table in AWS cloud services using only three services — AWS S3, AWS Glue, and AWS Athena. If you are not familiar with the latter two, don’t worry! We are going to use the GUI of AWS Glue Studio. We will only use SQL code to build and query objects, just as in any other database/tables.

Let’s get started.

Data Source

AWS S3 hosts many free and open-source data sets with consent from the providers. One such data set is NYC taxi data, which you can read more about here.

We will make use of the yellow taxi data set, so open your free (or paid) AWS account and go to this S3 bucket. (Use AWS US-EAST-1 (N.Virginia) region to host all services in the demo.)

https://s3.console.aws.amazon.com/s3/buckets/nyc-tlc

Create your own bucket in US-EAST-1 (N. Virginia) and copy the data from nyc-tlc/csv_backup folder to your bucket for “yellow_tripdata_####”, where #### is 2019, 2020, and 2021 in this example.

Copy data source

Update your bucket name and prefix (folder name) and leave everything else as defaults and submit the request.

Follow the same steps to copy the yellow taxi 2020 and 2021 data into your bucket.

Create Source table using AWS Glue

We will now use AWS Glue studio to build a table around these CSV files so we can easily query the raw data. This table will be saved as a glue catalog table (hive) but can be queried with multiple AWS services.

Open AWS Glue by typing Glue in the top search bar.

Click on the database under catalog and then click on add database, provide “yellowtaxi_db” as the name.

AWS Glue can read the CSV schema to form a table; this is called Crawlers. Glue can read the data types, but some data types like timestamp might be left as strings like timestamp. We can correct this after the table is built.

Click on the “yellowtaxi_db” database that we created earlier and then click on “Add tables using crawler.”

In the next screen click on “Add a data source.”

Provide the S3 URI path for the taxi trips data you copied from the nyc-tlc bucket.

In the next screen, click on “Create new IAM role” and note that the role automatically creates required permissions for the S3 source bucket/folder.

Then, choose your database and table name prefix.

Review, create the crawler, and then click on “Run crawler” to create the table.

Go back to your database to check if the table is created.

Check schema and change the pickup and drop-off columns as timestamp and save the changes. Ignore this step if the schema is correct.

Create Iceberg table using AWS Athena (Serverless)

Now that we have added our source data to the glue table, let’s build an Iceberg table using AWS Athena. AWS offers support for Apache Iceberg tables. It’s a serverless service, so we don’t need to worry about running our own infrastructure, resources, Spark configuration, etc.

Let’s first create an Iceberg table using a simple create statement. I am going to use my default (primary) work group. For some AWS regions, you might need to create a workgroup in preview terms. (I am using AWS US-EAST-1 (N.Virginia) region to host all my services.)

Choose your catalog and database. Now, let’s run our first query; here “raw_yellow_taxi_trips” is the table created using glue above.

select count(*) from raw_yellow_taxi_trips;

Next, let’s create our Iceberg table; note that there are few table properties for optimization that are not covered here. You can explore those with another experiment if you wish.

CREATE TABLE iceberg_yellow_taxi_parquet ( vendorid bigint, tpep_pickup_datetime timestamp, tpep_dropoff_datetime timestamp, passenger_count bigint, trip_distance double, ratecodeid bigint, store_and_fwd_flag string, pulocationid bigint, dolocationid bigint, payment_type bigint, fare_amount double, extra double, mta_tax double, tip_amount double, tolls_amount double, improvement_surcharge double, total_amount double, congestion_surcharge double)

LOCATION ‘s3://[yourS3Bucket]/yellow-taxi-trips-iceberg/’

TBLPROPERTIES (‘table_type’ = ‘ICEBERG’);

SQL to create table

Let’s check the S3 objects in the table location.

Insert the raw data into the shiny, new Iceberg table we created.

insert into iceberg_yellow_taxi_parquet select * from raw_yellow_taxi_trips;

The new Iceberg table now has the same number of records. (~124 M).

Check the bucket for new Iceberg table objects after the load.

Iceberg DML commands and Time travel

It’s time to try to update the data for this Iceberg table using simple SQL. First, I want to find out how many people don’t tip for long distance trips.

select * from iceberg_yellow_taxi_parquet where year(tpep_pickup_datetime) = 2021 AND trip_distance > 10 AND passenger_count > 3 AND tip_amount = 0 AND fare_amount <> 0 limit 10;

Update the tip amount for above results to 15% of fare.

update iceberg_yellow_taxi_parquet set tip_amount = (fare_amount * 0.15) where year(tpep_pickup_datetime) = 2021 AND trip_distance > 10 AND passenger_count > 3 AND tip_amount = 0 AND fare_amount <> 0 ;

Yes, I feel like the Robinhood.

But! oh no, I just updated the tips and goofed up my total amount, which reflects the grand total of all charges. Well not quite because Apache Iceberg provides time travel, which means I can go back and check an older version of the data.

SELECT * FROM iceberg_yellow_taxi_parquet

FOR SYSTEM_TIME AS OF (current_timestamp — interval ‘1’ hour)

WHERE year(tpep_pickup_datetime) = 2021 AND trip_distance > 10 AND passenger_count > 3 AND tip_amount = 0 AND fare_amount != 0 LIMIT 10;

Note the FOR SYSTEM statement that lets you retrieve data from the past. Very data engineering friendly!

Iceberg table Schema Evolution

Iceberg schema updates are metadata changes, so no data files need to be rewritten to perform the update. Iceberg supports column add, drop, rename, update, reorder, and certain data types upgrades.

The current table looks like this:

select * from iceberg_yellow_taxi_parquet limit 10;

In the examples below, we will rename the amount and distance column, expand the distance type to double, insert the distance column after the fare column, and add a new column “fare per mile.”

ALTER TABLE iceberg_yellow_taxi_parquet CHANGE COLUMN fare_amount fare double;

ALTER TABLE iceberg_yellow_taxi_parquet CHANGE COLUMN trip_distance distance double

COMMENT ‘The elapsed trip distance in miles reported by the taximeter.’ AFTER fare;

ALTER TABLE iceberg_yellow_taxi_parquet ADD COLUMNS (fare_per_mile float);

ALTER TABLE iceberg_yellow_taxi_parquet CHANGE COLUMN fare_per_mile fare_per_mile double COMMENT ‘taxi meter per mile fare’ AFTER distance;

— Let’s update the fare_per_mile value to fare divided by distance

UPDATE iceberg_yellow_taxi_parquet SET fare_per_mile = fare/distance;

Optimizing Iceberg tables

As the data accumulates in an Iceberg table, queries gradually become slower. There are a few techniques to optimize this for either faster reads or faster writes. To help optimize the performance of queries on Iceberg tables, Athena supports manual compaction.

Two commands that help us achieve that are:

OPTIMIZE

It merges delete files (more details here) to a targeted file size (write_target_data_file_size_bytes)

OPTIMIZE iceberg_yellow_taxi_parquet REWRITE DATA USING BIN_PACK ;

VACUUM

Let’s set the vacuum time to half a day (86400 seconds in one day).

ALTER TABLE iceberg_yellow_taxi_parquet SET TBLPROPERTIES (

‘vacuum_max_snapshot_age_seconds’=’43200') ;

Then just run,

VACUUM iceberg_yellow_taxi_parquet ;

Summary

We learned how easy it is to build an Apache Iceberg table using serverless features from one of the major cloud providers. You don’t need prior knowledge of Spark or other similar tools to build a cloud data lake or warehouse using the open-source Iceberg table format.

We also learned how easy it is to manage and maintain these tables using simple SQL statements, ALTER statements, and table properties.

Iceberg tables provide ACID transaction guarantees and performance for very large data sets. It provides a well-defined abstraction for your data lake that can be operated on at scale.

Next, here are few exercises for you to try with your new data set:

  • Try adding other historical data like 2017, 2018, etc., to this yellow taxi trips table.
  • Remember, we updated the tips without adding it to the total amount. This made the total wrong. Try fixing this using time travel. (Hint: you need to add 15% of the fare to the total as well.)
  • Explore various table properties and stored procedures available in Apache Iceberg here.
  • Learn about various table versions here.

REFERENCES:

https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg.html

https://iceberg.apache.org/spec/

https://aws.amazon.com/glue/

--

--