Apache Hive 4.x With Apache Iceberg (Part-I)

Ayush Saxena
3 min readJul 18, 2023

--

Apache Hive 4.x + Apache Iceberg 1.3.0

Introduction to Iceberg in Hive:

Iceberg is an open table format designed for large analytic workloads. It uses snapshots to store modifications to tables and provides several niche features like Time Travel, Partition/Schema Evolution and many others. Hive 4.x provides a way to use the iceberg table format out of the box like a normal hive table.

Here we will be looking at some basic stuff towards getting started with Iceberg in Hive & some latest stuff being introduced for iceberg in Apache 4.x release.

Creating an Iceberg table:

An iceberg table can be created like normal Hive tables by just specifying the “Stored by Iceberg” clause in the queries

  • Using the create statement:
create external table test_iceberg(id int, value string) stored by iceberg;
  • Using CTAS:
create external table tbl_ice stored by iceberg tblproperties ('format-version'='2') as
select * from source;
  • Using CTLT:
create external table iceberg_ctlt like acid_table stored by iceberg;

The source of CTAS/CTLT table can be either an external or a managed table.

Migrating an Existing Table to Iceberg:

Existing Hive external tables can be easily converted into iceberg tables by using the Alter queries with “Convert To Iceberg” clause

ALTER TABLE tableName CONVERT TO ICEBERG;

As of now only external tables can be converted to iceberg tables directly.

Ingesting data to Iceberg tables

Data can be ingested to iceberg tables like normal hive tables, Hive allows almost all ways of inserting data to iceberg tables like, Insert queries/Insert-Overwrite and now from Hive-4.x the LOAD queries as well.

  • Using the Load queries:

Iceberg LOAD queries for tables which haven’t undergone Schema/Partition evolution, hive directly uses the Iceberg Append API which prevents rewrite of the data files.

Example:

LOAD DATA LOCAL INPATH '/data/files/doctors.avro' OVERWRITE INTO TABLE ice_avro;
  • Load with partition clause
LOAD DATA LOCAL INPATH '/data/files/parquet_partition/pcol=100' INTO TABLE
ice_parquet_partitioned PARTITION (pcol='100');

If the config “hive.load.data.use.native.api” is set to “false”, hive will rather than using the Append api, will launch a Tez job to do the LOAD.

Reading Iceberg tables:

Hive supports all kinds of read queries on the iceberg table like any other hive managed tables. The iceberg table can be queried like a normal hive table. like:

SELECT * FROM iceberg_table where id=10;

Hive supports both Vectorized & Non-Vectorized modes of reading iceberg tables.

Snapshot Management:

Hive provides several ways to play with the iceberg snapshots

  • Set current snapshot:
ALTER TABLE iceberg_table EXECUTE SET_CURRENT_SNAPSHOT(<snapshotId>);

The above sets the current snapshot of the table to the snapshot id provided.

  • Rollback Snapshot:
ALTER TABLE tbl_ice EXECUTE ROLLBACK(<snapshotId>)

The rollback commands, rollsback the table to a prior snapshot state of the table. (The Rollback commands was added prior to SET_CURRENT_SNAPSHOT, which can be used to achieve same results)

  • Expire Snapshots:

Hive provides several ways to expire a snapshot:

  • Specifying a list of snapshots to expire:
ALTER TABLE iceberg_table EXECUTE EXPIRE_SNAPSHOTS('<comma seperated list of snapshot ids>')
  • Expiring Snapshots older than a particular timestamp
ALTER TABLE iceberg_table EXECUTE EXPIRE_SNAPSHOTS('<timestamp>');

Time Travel:

Iceberg tables can be queried to get the state of a the table from past. Time travel queries can be performed by either using a snapshot id or a timestamp

  • Using timestamp:
SELECT * FROM tableName 
FOR SYSTEM_TIME AS OF '<TIMESTAMP>';
  • Using snapshot id:
SELECT * FROM tableName FOR SYSTEM_VERSION AS OF <Snapshot Id>

Coming in Next Sections?

  • Materialized Views & Iceberg
  • Branches & Tags (Published)
  • More advanced ways of playing with snapshots
  • Getting started with Hive-Iceberg with Docker
  • Metadata Tables & Operations around them
  • Specifics around Iceberg V2 tables
  • Performance Comparisons with Iceberg Tables.

Release:

The blog mentions features being part of Apache Hive-4.0.0 release

— — — — — — — — — — — — — — END — — — — — — — — — — — — — — — — — — — -

--

--

Ayush Saxena

ASF Member || Apache Hadoop, Apache Hive, Apache Tez. Apache Ozone & Apache Incubator PMC Member & Committer || Staff Software Engineer @Cloudera