Iceberg Tables in Snowflake
Iceberg tables are a new type of table in Snowflake where the actual data is stored outside of the Snowflake database. Instead, the data resides in a public cloud object storage location such as Amazon S3, Google Cloud Storage, or Azure Storage. These tables use the Apache Iceberg table format.
What Are Iceberg Tables?
- An Iceberg table uses the Apache Iceberg open table format specification. This format provides an abstraction layer on data files stored in open formats (such as Apache Parquet) and supports features like:
- ACID (atomicity, consistency, isolation, durability) transactions.
- Schema evolution: You can evolve the schema of an Iceberg table over time.
- Hidden partitioning: Iceberg tables allow you to partition data efficiently.
- Table snapshots: You can create snapshots of the table at specific points in time.
- Iceberg tables are designed to store data outside of Snowflake, leveraging public cloud object storage (such as Amazon S3, Google Cloud Storage, or Azure Storage).
How Iceberg Tables Work:
Data Storage:
- Iceberg tables store their data and metadata files in the external cloud storage location. Snowflake does not provide fail-safe storage for Iceberg tables; you are responsible for managing the external storage, including data protection and recovery.
- Snowflake connects to your storage location using an external volume, which is a named, account-level Snowflake object. The external volume stores an identity and access management (IAM) entity for your cloud storage. A single external volume can support one or more Iceberg tables.
Iceberg Catalog:
- An Iceberg catalog enables a compute engine (like Snowflake) to manage and load Iceberg tables.
- The catalog stores metadata pointers for one or more tables, mapping table names to the location of their current metadata files. Snowflake supports different catalog options, including using Snowflake itself as the Iceberg catalog or integrating with an external Iceberg catalog.
Cross-Cloud/Cross-Region Support:
- Iceberg tables can span multiple cloud providers and regions.
Billing:
- Iceberg tables incur no Snowflake storage costs, as the data resides externally. You only pay for the external cloud storage.
- To set up an external volume for Iceberg tables, configure it using the Snowflake interface.
Creating and Using Iceberg Tables:
Create an Iceberg Table:
- Define an Iceberg table in Snowflake, specifying the external storage location.
- Example: SQL
CREATE TABLE my_iceberg_table
USING ICEBERG
LOCATION = 's3://my-bucket/my-path';
Querying Iceberg Tables:
- Query Iceberg tables just like regular Snowflake tables.
- Example: SQL
SELECT COUNT(*) FROM my_iceberg_table WHERE column1 = 'value';
Should Iceberg Be Used for Time Travel?
While Iceberg tables provide benefits for managing data, they do not inherently support time travel.
- If you need time travel functionality, it’s recommended to use Snowflake’s built-in time travel features.
- You can create regular Snowflake tables (not Iceberg tables) and take advantage of time travel for historical data queries.
What Makes Iceberg Catalogs So Special in Snowflake?
What Is an Iceberg Catalog?
- An Iceberg catalog is a component that manages metadata for Iceberg tables.
- It provides information about table schemas, partitions, data files, and other relevant details.
- In the context of Snowflake, the Iceberg catalog allows Snowflake to interact with Iceberg tables stored externally on your chosen storage system (e.g., Amazon S3, Google Cloud Storage, or Azure Storage).
Why Are Iceberg Catalogs Special?
- External Storage Integration: Iceberg catalogs enable seamless integration between Snowflake and external storage systems.
- Cost Efficiency: By storing data externally, you can take advantage of lower storage costs compared to Snowflake’s native storage.
Data Management Features:
- ACID Transactions: Iceberg tables support atomicity, consistency, isolation, and durability.
- Schema Evolution: You can evolve the schema of an Iceberg table over time.
- Hidden Partitioning: Efficiently manage partitioned data.
- Table Snapshots: Create snapshots of the table at specific points in time.
- Query Performance: Iceberg tables maintain performance benefits similar to regular Snowflake tables while leveraging external storage.
Using Iceberg Catalogs in Snowflake:
- When creating an Iceberg table in Snowflake, you have two options for the catalog:
1.Snowflake as the Iceberg Catalog:
- You can use Snowflake itself as the Iceberg catalog.
- Set the
CATALOG
parameter to'SNOWFLAKE'
in theCREATE ICEBERG TABLE
command.
2.External Catalog Integration:
- Connect to an external Iceberg catalog using a catalog integration.
- Specify the external catalog details during table creation.
Example: Using Snowflake as the Iceberg Catalog:
- To query Iceberg tables using the Apache Spark engine, configure the following properties for your Spark cluster:
spark-shell --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.13:1.2.0,net.snowflake:snowflake-jdbc:3.13.28
--conf spark.sql.catalog.snowflake_catalog = org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.snowflake_catalog.catalog-impl = org.apache.iceberg.snowflake.SnowflakeCatalog
--conf spark.sql.catalog.snowflake_catalog.uri ='jdbc:snowflake://<account_identifier>.snowflakecomputing.com'
--conf spark.sql.catalog.snowflake_catalog.jdbc.user = <user_name>
--conf spark.sql.catalog.snowflake_catalog.jdbc.password = <password>
--conf spark.sql.catalog.snowflake_catalog.jdbc.private_key_file = <location of the private key>
- After configuration, you can query available tables:
spark.sessionState.catalogManager.setCurrentCatalog("snowflake_catalog")
spark.sql("SHOW NAMESPACES").show()
spark.sql("SHOW TABLES").show()