An Overview of Snowflake Apache Iceberg Tables

Come Learn with Me as I Experiment with Snowflake’s Latest Preview feature: Apache Iceberg Tables

Photo by Alexander Hafemann on Unsplash

If you have not picked until now, my background is not really data. I have had the chance to have a unique and quite diverse career, so I often take this article as my way to learn about the topic. If I am going to research and learn, I might as well write about it. My articles may get basic, although I try to relate to advanced topics.

The format of how we store data and its related metadata can help us get more from the data and help with many aspects of the modern data stack.

What is Apache Iceberg?

Iceberg is an open-source project hosted by the Apache Foundation. It initially started as a Netflix OSS project and eventually moved to a community project. From what I can see, it is genuinely a super high-performance format for those who have really large analytic tables.

Last week, I saw some comparisons of native tables in the Snowflake vs Snowflake Iceberg External table, and the performance, although better on native tables, is quite close. Iceberg allows for datalakes to have a single format that can be used by multiple systems, like Snowflake, Databricks, and other engines like Spark, Trino, Flink, Presto, Hive, and Impala to safely work with the same tables, at the same time.

It has neat features like:

  • Expressive SQL Iceberg supports flexible SQL commands to merge new data, update existing rows, and perform targeted deletes. Iceberg can eagerly rewrite data files for read performance or delete deltas for faster updates.
  • Full Schema Evolution: Schema evolution just works. Adding a column won’t bring back old or weird data. Columns can be renamed and reordered. Best of all, schema changes never require rewriting your table.
  • Hidden Partitioning, producing partition values for rows in a table, and skipping unnecessary partitions and files automatically. No extra filters are needed for fast queries, and the table layout can be updated as data or queries change.
  • Time Travel and Rollback, reproducible queries or lets users easily examine changes. Version rollback allows users to quickly correct problems by resetting tables to a good state.
  • Data Compaction is supported out-of-the-box with support for bin-packing or sorting to optimize file layout and size.

Snowflake Apache Iceberg

First things first: Iceberg will always use customer-controlled external storage, like an AWS S3 or Azure Blog Storage.

Snowflake Iceberg Tables support Iceberg in two ways: an Internal Catalog (Snowflake-managed catalog) or an externally managed catalog (AWS Glue or Objectstore).

Iceberg Tables: Snowflake-managed catalog

A Snowflake-managed catalog is nearly identical performance as a regular Snowflake table and has the following characteristics.

  • Snowflake reads/writes
  • Iceberg interoperability
  • Full platform support
  • Performance optimized

Iceberg Tables: Externally managed catalog

Externally managed catalogs like AWS Glue or you can use Iceberg metadata files stored in object storage to create a table and have the following characteristics:

  • Flexible sources
  • Efficient onboarding
  • Simplified operations
  • Performance optimized

Practical Learning

I used a Snowflake lab to learn, and here are some scripts to practice. Do not give me credit for the SQL scripts coming from the Snowflake lab.

It truly keeps track of all the changes in a super neat way.

-- Lab starting point, creating required lab
CREATE WAREHOUSE iceberg_lab;
CREATE ROLE iceberg_lab;
CREATE DATABASE iceberg_lab;
CREATE SCHEMA iceberg_lab;
GRANT ALL ON DATABASE iceberg_lab TO ROLE iceberg_lab WITH GRANT OPTION;
GRANT ALL ON SCHEMA iceberg_lab.iceberg_lab TO ROLE iceberg_lab WITH GRANT OPTION;;
GRANT ALL ON WAREHOUSE iceberg_lab TO ROLE iceberg_lab WITH GRANT OPTION;;

CREATE USER iceberg_lab
PASSWORD='PASSWORD',
LOGIN_NAME='ICEBERG_LAB',
MUST_CHANGE_PASSWORD=FALSE,
DISABLED=FALSE,
DEFAULT_WAREHOUSE='ICEBERG_LAB',
DEFAULT_NAMESPACE='ICEBERG_LAB.ICEBERG_LAB',
DEFAULT_ROLE='ICEBERG_LAB';

GRANT ROLE iceberg_lab TO USER iceberg_lab;
GRANT ROLE iceberg_lab TO USER YOURUSERNAME;
GRANT ROLE accountadmin TO USER iceberg_lab;

USE ROLE accountadmin;
ALTER USER iceberg_lab SET rsa_public_key='Mxxxxxxiiiixxmmm..';

-- Create External Volume After doing AWS Setup and retrieve Snowflake AWS ID
CREATE OR REPLACE EXTERNAL VOLUME iceberg_lab_vol
STORAGE_LOCATIONS =
(
(
NAME = 'youriceberglab'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = 's3://youriceberglab/datalake/'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::XXXXXXXXX:role/yournowflakerole'
)

);

-- find STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID and updated your IAM role
DESC EXTERNAL VOLUME iceberg_lab_vol;

-- Grant iceberg_lab to external volume
USE ROLE accountadmin;
GRANT ALL ON EXTERNAL VOLUME iceberg_lab_vol TO ROLE iceberg_lab WITH GRANT OPTION;

-- Create a Snowflake-managed Iceberg Tabl
USE ROLE iceberg_lab;
USE DATABASE iceberg_lab;
USE SCHEMA iceberg_lab;
CREATE OR REPLACE ICEBERG TABLE customer_iceberg (
c_custkey INTEGER,
c_name STRING,
c_address STRING,
c_nationkey INTEGER,
c_phone STRING,
c_acctbal INTEGER,
c_mktsegment STRING,
c_comment STRING
)
CATALOG='SNOWFLAKE'
EXTERNAL_VOLUME='iceberg_lab_vol'
BASE_LOCATION='';


-- Loading Data
INSERT INTO customer_iceberg
SELECT * FROM snowflake_sample_data.tpch_sf1.customer;

-- Sample query
SELECT
*
FROM customer_iceberg c
INNER JOIN snowflake_sample_data.tpch_sf1.nation n
ON c.c_nationkey = n.n_nationkey;

-- Sample query with time travel example
INSERT INTO customer_iceberg
SELECT
*
FROM snowflake_sample_data.tpch_sf1.customer
LIMIT 5;

SELECT
count(*) AS after_row_count,
before_row_count
FROM customer_iceberg
JOIN (
SELECT count(*) AS before_row_count
FROM customer_iceberg BEFORE(statement => LAST_QUERY_ID())
)
ON 1=1
GROUP BY 2;


-- Clean up
DROP WAREHOUSE iceberg_lab;
DROP DATABASE iceberg_lab;
DROP USER iceberg_lab;
DROP EXTERNAL VOLUME iceberg_lab_vol;
DROP ICEBERG TABLE customer_iceberg;

Icerberg Table Convertion

In practical terms, a managed Iceberg table vs non-managed is the ability to write data inside Snowflake or not, for managed you can write and for non-managed, it’s read-only.

It is quickly done, and it is a matter of doing a refresh of the metadata beforehand and running this command:

ALTER ICEBERG TABLE customer_iceberg CONVERT TO MANAGED
BASE_LOCATION = CustomerBaseLocation;

Why is it Important?

I feel that in the world of data, choosing one tool only for large companies with large data sets can be a challenge; some features do not work that well in specific tools. Some companies still choose to keep both tools in the picture to meet the business needs of every use case. This is where a common data lake format can become very helpful, and sharing data across all tools becomes useful even if needed.

We need to look at how data lakes are typically set up now; they are usually in object storage like AWS S3, Azure Blob, or Google Storage. This is different from Hadoop, more commonly used a few years ago. This brings some challenges with metadata management, schema management, and data catalogs. Iceberg allows for direct and concurrent updates with all tools considered equal.

Let’s face it — Iceberg also has much bigger community support with the likes of Netflix, Apple, and the Apache community.

I found this comparison of Iceberg against other formats:

https://www.dremio.com/resources/guides/apache-iceberg/

Below, this is how files look like in an S3 bucket. There is always a data with data files in Parquet format, and metadata folder, metadata files and manifest files.

Conclusion

I hope this article gives you a good overview of how Apache Iceberg works and how Snowflake is implemented; there are quite a few good use cases for the real world.

For one, you can share that same use case between different systems, like Databricks and Snowflake. See another Medium article about this.

Apache Iceberg was built fully open-source from the ground up — it is engine and format agnostic.

https://hubs.ly/Q02gk4Q50

I’m Augusto Rosa, VP of Engineering for Infostrux Solutions. Thanks for reading my blog post. You can follow me on LinkedIn or my Blog Posts.

Subscribe to Infostrux Medium Blogs https://medium.com/infostrux-solutions for the most interesting Data Engineering and Snowflake news.

Sources:

--

--