Iceberg Tables Using Snowflakes

Vishal Garg
6 min readMar 1, 2024

--

What is an Iceberg table?

An Iceberg table in Snowflake is a type of table optimized for handling large volumes of data, especially for analytical workloads. It employs a technique similar to “predicate pushdown” where Snowflake intelligently manages the storage and querying of data based on the structure of the table and the queries executed against it.

How does it work?

Iceberg tables in Snowflake utilize a columnar storage format which helps in compressing and organizing the data efficiently. This format enables faster query performance and reduces the amount of data scanned during queries, thereby improving overall efficiency.

Key features:

  1. Metadata Separation: Iceberg tables separate metadata from data, allowing for more efficient querying by only accessing relevant data.
  2. Incremental Data Updates: Iceberg tables support efficient updates to existing data by using an append-only model, which can be particularly useful in scenarios where you frequently add new data.
  3. Schema Evolution: Iceberg tables provide support for evolving schemas over time without requiring a full table rewrite, making it easier to adapt to changing data structures.
  4. Partitioning and Clustering: Like standard Snowflake tables, Iceberg tables support partitioning and clustering, which further enhances query performance by organizing data based on specified criteria.
  5. Query Performance: By minimizing the amount of data scanned during queries and leveraging efficient storage mechanisms, Iceberg tables can significantly improve query performance, especially for large-scale analytics.

Use cases:

  • Big Data Analytics: Iceberg tables are well-suited for handling large volumes of data commonly encountered in big data analytics scenarios.
  • Data Warehousing: They are often used in data warehousing environments where optimizing query performance and managing large datasets are critical requirements.
  • Data Lake Integration: Iceberg tables can also be integrated with data lakes to provide structured access to semi-structured or unstructured data stored in cloud storage platforms like Amazon S3 or Azure Data Lake Storage.

How to Get it Implemented in Different Ways :

The first thing required to get the Iceberg table underway is external volume. An external volume is a named, account-level Snowflake object that stores an identity and access management (IAM) entity for your external cloud storage. Snowflake securely connects to your cloud storage with an external volume to access table data, Iceberg metadata, and manifest files that store the table schema, partitions, and other metadata.

CREATE OR REPLACE EXTERNAL VOLUME exvol
STORAGE_LOCATIONS =
(
(
NAME = 'eu-west-1'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = 's3://<bucket name>/'
STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::123456789012:role/myrole'
STORAGE_AWS_EXTERNAL_ID = '<name of external ID' ));

How to get STORAGE_AWS_ROLE_ARN and STORAGE_AWS_EXTERNAL_ID ?

desc STORAGE INTEGRATION <name of the storage integration>;

the above statement will give the two required values.

Also we have to perform the below two steps in AWS. Please follow these links to achieve the same.

https://docs.snowflake.com/en/user-guide/tables-iceberg-configure-external-volume#label-tables-iceberg-configure-external-volume-s3

  1. Configure access permissions for the S3 bucket.

2. Create an IAM role in AWS

Create Snowflakes Managed Iceberg table

Please follow the below syntax to create, insert, select and drop the iceberg table. It works pretty much the same like ordinary Snowflake table. Here the important point to note is, it’s Snowflake managed Iceberg Table where metadata is managed by Snowflakes.

CREATE OR REPLACE ICEBERG TABLE customer_iceberg_test_13(
NAME varchar ,
TYPE varchar ,
SUBSCRIPTION varchar
)
CATALOG='SNOWFLAKE'
EXTERNAL_VOLUME='exvol'
BASE_LOCATION='';
insert into customer_iceberg_test values('vishal','Architect','aws dev');
select * from customer_iceberg_test_13;
drop table customer_iceberg_test_13;

Iceberg table permits all time travel queries just like regular snowflakes table. Also consider a scenario if you have an existing External Table on S3 and we want the same to be converted to Iceberg table, please conside the below DDL.

CREATE OR REPLACE ICEBERG TABLE TEST_DATA_ib
with external_volume='exvol'
CATALOG='SNOWFLAKE'
BASE_LOCATION='mylocation'
as select name,type from TEST_DATA;

Here the Iceberg table is getting created using the snowflake meta data and the base location signifies the name in S3 where metadata & table data will reside.

Create AWS Glue Managed Iceberg table

This is another way of creating the snowflake table using external Catalog/source for meta data. Assume a scenario if you have .parquet files in your S3 location which you want to convert to Iceberg Table so that you can use Snowflakes compute to analyze the data. Follow steps need to be performed to get the desired result.

  1. AWS Glue Catalog Integration Object need to be create inside snowflakes.
CREATE OR REPLACE CATALOG INTEGRATION glueCatalogInt
CATALOG_SOURCE=GLUE
CATALOG_NAMESPACE='default'
TABLE_FORMAT=ICEBERG
GLUE_AWS_ROLE_ARN='arn:aws:iam::123456789012:role/my_role'
GLUE_CATALOG_ID='123456789012'
GLUE_REGION='eu-west-1'
ENABLED=TRUE;
DESCRIBE CATALOG INTEGRATION glueCatalogInt;

Once the integration is done you can check the same using describe command.

2. I have used the Glue Crawler to convert the S3 files to parquet table inside glue catalog. We have to remember in Glue Catalog we need Iceberg table to obtain the meta data which will be finally read using above integration. This way you can read your .parquet file using AWS Athena.

File converted to parquet table format

3. Next step is to convert this table to Iceberg format so that Snowflakes can use it. The above step is optional if you have a lakehouse & already having parquet tables. I have used the Glue Notebook to do this conversation cell by cell.

Note in each of the following notebook steps you will need to create a new cell, copy and edit the code as needed, then execute the cell in the notebook.

%session_id_prefix iceberg-upgrade-add-files-
%glue_version 4.0
%idle_timeout 300
%number_of_workers 2
%worker_type G.1X
%%configure
{
"--conf": "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
"--datalake-formats": "iceberg"
}

Next we will set the variables for the notebook.

catalog_name = "glue_catalog"
bucket_name = "<bucket name>" # your Amazon S3 bucket name for this post
bucket_prefix = "<bucket prefix path to the parquet files>"
database_name_par = "<GDC database name for the parquet files>"
database_name_ice = " <GDC database name for the iceberg tables> "
source_table_name = "<name of the source parquet table in the GDC>"
target_table_name = " <name of the Iceberg table that will be created> "
warehouse_path = f"s3://{bucket_name}/{bucket_prefix}"

Next we will initiate a Spark session.

from pyspark.sql import SparkSession
spark = SparkSession.builder \
.config(f"spark.sql.catalog.{catalog_name}", "org.apache.iceberg.spark.SparkCatalog") \
.config(f"spark.sql.catalog.{catalog_name}.warehouse", warehouse_path) \
.config(f"spark.sql.catalog.{catalog_name}.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog") \
.config(f"spark.sql.catalog.{catalog_name}.io-impl", "org.apache.iceberg.aws.s3.S3FileIO") \
.getOrCreate()

Let us create view on 100 rows for test incase your parquet data table has millions of rows. Parquet table name is there in the above snip from point2.

%%sql
CREATE OR REPLACE TEMPORARY VIEW tmp_<temp_view_name> AS
SELECT * FROM <parquet_table_name> LIMIT 100

The next step will read the parquet data files and generate the Iceberg metadata files, as well as update the GDC with the Iceberg information.

query = f"""
CALL {catalog_name}.system.add_files(table => '{database_name_ice}.{target_table_name}', source_table => '{database_name_par}.{source_table_name}')
"""

Lastly you can verify the Iceberg metadata by listing the snapshots and manifest location. The Iceberg table is now ready for use.

query = f"""
SELECT snapshot_id, manifest_list FROM {catalog_name}.{database_name_ice}.{target_table_name}.snapshots
"""

spark.sql(query).show(10, truncate=False)

4. Now we have the Iceberg Table in GDC. We are all set to create table inside Snowflakes.

CREATE or Replace ICEBERG TABLE myGlueTable
EXTERNAL_VOLUME='exvol'
CATALOG='glueCatalogInt'
CATALOG_TABLE_NAME='<table name from point 2 from GDC>';

5. Finally the Iceberg table is ready inside Snowflakes for all kind of operations.

Conclusion

In conclusion, Snowflake’s Iceberg tables, whether internally or externally managed, offer a powerful solution for organizations seeking to optimize the management and querying of large datasets. Internally managed Iceberg tables provide seamless integration within Snowflake’s ecosystem, offering features such as metadata separation, incremental data updates, schema evolution, and efficient query performance. These capabilities enable organizations to conduct big data analytics, streamline data warehousing processes, support real-time data analysis, and explore historical data with ease and efficiency.

Externally managed Iceberg tables extend the benefits of Snowflake’s data platform by enabling integration with external data lakes, such as Amazon S3 or Azure Data Lake Storage. This integration preserves the scalability and cost-effectiveness of data lakes while leveraging Iceberg’s efficiency and performance for querying structured data. It facilitates seamless data exploration, experimentation, archiving, and compliance across diverse use cases and industries.

Whether internally or externally managed, Snowflake’s Iceberg tables empower organizations to derive valuable insights, make data-driven decisions, and drive business value from their data assets. By optimizing data storage, access, and querying processes, Iceberg tables contribute to improved operational efficiency, enhanced analytical capabilities, and greater agility in responding to evolving business needs. Overall, Snowflake’s Iceberg tables represent a valuable asset for organizations striving to unlock the full potential of their data resources in today’s data-driven world.

--

--