Apache Iceberg and Google Cloud
Wanted to time-travel, query a lake at a point-in-time, and support schema evolution effortlessly? Then read on.
Table formats and Data Lakes
Sitting down and choosing a table format is the first and biggest decision when implementing a data-lake or data-mesh strategy. Pick CSV and you will limit your ability for Schema Evolution. Go with Avro and reading and querying this row-based storage format may take a while. So what Data Lake Storage Layers are available?
From my reading, Apache Hudi, Apache Iceberg and Databricks Delta Lake dominate. With Hudi it’s all about streaming updates, Iceberg its really large-scale data-lakes, and Delta it was about ACID transactions but now has a whole heap more features
Google Clouds ‘Biglake’ recently announced support for Iceberg which allows for users to have the ability to gain BigQuery benefits on incredibly large GCS data as it were stored natively.
Apache Iceberg
https://github.com/apache/iceberg
Self described as `The open table format for analytic datasets`, and as a cheer-leader for open source and open standards, let’s dive in.
Iceberg uses a table format so in addition to writing the data to storage, you would also save it as a table. Iceberg uses manifest files instead of directory lists, so data on Storage is *not removed.*
Iceberg uses a combination of manifest files, and metadata catalogues. When saving to storage, it’s saved as a table name which is a pointer to the last commit, to go read a particular metadata file. Let’s take a look at the shell.
> ls iceberg/data/
metadata/
—
>ls iceberg/datacreation_date=2015–01–01/
creation_date=2015–01–01/
Iceberg uses Hive-style partitioning by default
> ls iceberg/metadata
00000–55gcadf-f081-sdfc.metadata.json
0swscs-kcdedc-1873-b4b0-m0.avro
Here we can see we’ve got two folders in the main `iceberg` folder. Inside the `data` folder we see our two partitions and then we see a number of files in the `metadata` folder including `.avro` and JSON.
In the Iceberg catalog, there’s a table in there that points to the current metadata file.
— That metadata file (s0) points to a manifest list.
— The manifest list points to one or many manifest files
The manifest files has references to what data files make up the datasets
What happens if you update the data?
— It creates a new Metadata file
— Points to the new manifest list
— Which then points to a manifest file
— In turn points to the new data files that make up the dataset
Here’s a diagram:
An important thing to note here is that the Metadata file contains (among other things)
— `location”: “gs://<BUCKET>/iceberg”`
— `”schema”: { “fields”: [{“id”: 1}]}`
— ``”snapshots”: [{“manifest-list”, “schema-id”, “snapshot-id”, “summary”}]`
- The metadata.json has location, which points to the location of where your data is stored.
— The schema supports schema-evolution, so you can view the schema at any point in time
— snapshots which contains a listing of manifests that stitches all the data together
Snapshots
Point-in-time queries, or time-travel. You can supply a `snapshot-id` or `as-of-timestamp`. Really useful. 😎
GCP BigLake
https://cloud.google.com/biglake
Apache Spark already has rich support for Iceberg, allowing customers to use Iceberg’s core capabilities, such as DML transactions, and schema evolution, to carry out large-scale transformation and data processing.
BigLake is a storage engine that unifies data warehouses and lakes by enabling BigQuery and open source frameworks like Spark to access data with fine-grained access control.
Regardless of your choice of Spark, BigLake automatically makes those Iceberg tables available for end users to query.
Access
Administrators can now use Iceberg tables, similar to BigLake tables, and don’t need to provide end users access to the underlying GCS bucket. In fact, you can get as granular as row, column-level-access or data-masking management and governance, extending BigLake governance framework to Iceberg tables.
Wrap up
Any extension of BigQuery’s core functionality can be powerful. It’s great to see open standards getting adopted.
Coupled with Googles Analytics Hub, data providers can now create shared datasets to share Iceberg tables on GCS. Consumers of the shared data can use any Iceberg compatible supported query engine to consume the data, providing and open and governed model of sharing and consuming data.
You can even us BigQuery ML to extend machine learning workloads to Iceberg tables stored on GCS.