Apache Hive-4.x with Iceberg Branches & Tags

6 min readSep 19, 2023

Apache Iceberg Branch & Tags With Apache Hive 4.x

Introduction:

For sophisticated snapshot lifecycle management, Iceberg supports branches and tags which are named references to snapshots with their own independent lifecycles. This lifecycle is controlled by branch and tag level retention policies. Branches are independent lineages of snapshots and point to the head of the lineage.

Prerequisites:

Working knowledge of Apache Hive & Apache Iceberg. If not, can read here

Initial Setup:

Create a docker env with version: 4.0.0, Read (Hive Iceberg Docker QuickStart) for steps to create a docker environment. Specify the version as 4.0.0
Create an iceberg table:

CREATE TABLE test (ID INT) STORED BY ICEBERG TBLPROPERTIES('format-version'='2');

Insert some data to generate multiple snapshots:

INSERT INTO test VALUES (1), (2);
INSERT INTO test VALUES (3), (4);
INSERT INTO test VALUES (5), (6);
INSERT INTO test VALUES (6), (7);

Fetch the list of snapshots corresponding to the iceberg table

SELECT * FROM default.test.history;

The above will list the snapshots corresponding to the table which can be used for creating branches & Tags.

Sample output:

Note: The snapshot id & the timestamp will be different for everyone.

Creating a branch:

A branch can be created from a table via a simple Alter Table… Create Branch… statement specifying the table & branch name along side either the System Version or the System Time corresponding to the snapshot which should act as the HEAD of the branch.

Creating using SYSTEM_VERSION:

ALTER TABLE test CREATE BRANCH branch1 FOR SYSTEM_VERSION AS OF 3369973735913135680;

The above creates a branch named branch1 from table test corresponding to the specified snapshot id(Second in the history table).

Creating using SYSTEM_TIME:

ALTER TABLE test CREATE BRANCH branch2 FOR SYSTEM_TIME AS OF '2023-09-16 09:46:38.939 Etc/UTC';

The above creates a branch named branch2 from table test corresponding to the state of the table at the specified timestamp. (Third in the history table)

Creating with default configurations:

In case no arguments are provided to specify the snapshot version, a branch can be created which points to current main branch(current state of table) of the iceberg table.

ALTER TABLE test CREATE BRANCH branch3;

The above creates a branch named branch3 which is in same state as the current table test.

Optionally the number of snapshot retentions per branch can be specified while creating a branch like:

ALTER TABLE test CREATE BRANCH branch4 FOR SYSTEM_VERSION AS OF 3369973735913135680 with SNAPSHOT RETENTION 5 SNAPSHOTS;

Listing created Branches & Tags:

The created branches & Tags can be listed using the refs metadata table of iceberg.

select * from default.test.refs;

Writing into the Branch:

Data can be ingested into an iceberg table branch similarly as it would be done in case of normal iceberg table. In the queries the name of the branch needs to be specified in following manner.

<Database Name>.<Table Name>.branch_<Branch Name>

Insert into a branch

INSERT INTO TABLE default.test.branch_branch1 VALUES (10, 11);

The above inserts values into the ‘branch1’ of table ‘test’ in ‘default’ database

Update values in a branch

UPDATE TABLE default.test.branch_branch1 SET ID=20 WHERE ID=10;

The above query updates the values in ‘branch1’ of table ‘test’ in ‘default’ database.

Delete values in a branch

DELETE FROM default.test.branch_branch1 WHERE ID=11;

The above query deleles the specified value in ‘branch1’ of table ‘test’ in ‘default’ database.

→ Apart from these all other queries like Merge/IOW/Load are supported with iceberg branches ←

Querying a Branch:

An iceberg branch supports all query statements like any other hive table.

Example:

select * from default.test.branch_branch1;

Sample output:

→ Apart from this all other read queries are supported with iceberg branches←

Fast-Forward a Branch:

An iceberg branch which is an ancestor of another branch can be fast-forwarded to the state of the other branch.

Example:

Insert some data into main branch

INSERT INTO test values (55), (66);

Now fast-forward the previouslly created branch3 to the current main branch

ALTER table test EXECUTE FAST-FORWARD 'branch3' 'main';

This fast-forwards the branch3 to the state of main. Querying branch3 now would show the newly inserted records.

Output of branch3 after being fast-forwarded to main branch

In case the second branch name is not specified the main branch gets fast-forwaded to the branch specified.

Example-2:

Delete some values from ‘branch3'

DELETE FROM default.test.branch_branch3 WHERE ID=66;

Fast-Forward the main branch to the state of branch3

ALTER table test EXECUTE FAST-FORWARD 'branch3';

The above fast-forwards the main branch of the table test to the state of branch3.

Output of main branch of table test after being fast-forwarded to branch3

Cherry-Pick into a Branch:

Cherry-pick of commits as of now is supported only on the main branch of an iceberg table.

So, if we rollback to a previous commit & we want to just pull in changes from a single commit from future states we can cherry-pick that commit.

Example:

Fetch the current history of the table:

SELECT * FROM default.test.history;

Rollback the current table:

ALTER table test EXECUTE ROLLBACK (3369973735913135680);

The above rollbacks the main table to the second snapshot in the history list.

Cherry-Pick the second last snapshot from the history table which inserted 55 & 56

 ALTER table test EXECUTE CHERRY-PICK 8602659039622823857;

Deleting a Branch:

An iceberg branch can be deleted like:

ALTER TABLE test DROP BRANCH branch1;

The above query drops the ‘branch1’ of table test.

The drop branch syntax supports ‘IF EXISTS’ clause as well, to prevent errors if the branch is already deleted or doesn’t exist. Like:

ALTER TABLE test DROP BRANCH IF EXISTS branch1;

Creating a Tag:

A tag can be created from a table using the Alter table…Create Tag statements and by specifying the table & branch name along side either the System_version or the System_time corresponding to the snapshot which the tag should refer to.

Creating using SYSTEM_VERSION:

ALTER TABLE test CREATE TAG tag1 FOR SYSTEM_VERSION AS OF 3369973735913135680;

The above creates a tag named tag1 from table test corresponding to the specified snapshot id(Second in the history table).

Creating using SYSTEM_TIME:

ALTER TABLE test CREATE TAG tag2 FOR SYSTEM_TIME AS OF '2023-09-16 09:46:38.939 Etc/UTC';

The above creates a tag named tag2 from table test corresponding to the state of the table at the specified timestamp. (Third in the history table)

Creating with default configurations:

In case no arguments are provided to specify the snapshot version, a branch can be created which points to current main branch(current state of table) of the iceberg table.

ALTER TABLE test CREATE TAG tag3;

The above creates a tag named tag3 which is in same state as the current table test.

Querying a Tag:

The iceberg tags supports all kinds of read queries, the queries can be performed by specifying the tag name instead of the table name in the following format.

<Database Name>.<Table Name>.tag_<Tag Name>

Example:

SELECT * FROM default.test.tag_tag1;

Deleting a Tag:

An iceberg tag can be deleted like:

ALTER TABLE test DROP TAG tag1;

The above query drops the tag1 corresponding to the table test.

The drop tag syntax supports ‘IF EXISTS’ clause as well, to prevent errors if the tag is already deleted or doesn’t exist. Like:

ALTER TABLE test DROP TAG IF EXISTS tag1;

Release:

The blog mentions features being part of Apache Hive-4.0.0 release

Note: The docker images are meant only for dev. usage.

Apache Hive-4.x with Iceberg Branches & Tags

Introduction:

Prerequisites:

Initial Setup:

Creating a branch:

Listing created Branches & Tags:

Writing into the Branch:

Querying a Branch:

Fast-Forward a Branch:

Cherry-Pick into a Branch:

Deleting a Branch:

Creating a Tag:

Querying a Tag:

Deleting a Tag:

Release:

Written by Ayush Saxena