Understanding Iceberg Table Metadata

Dated: 30-Jan-2023

Companies are choosing data lakes to store their data for analytics and insights. Any challenges they encounter with data lakes are often related to performance and ACID-compliant transactions when multiple consumers are trying to work on the same data set. However, using a table format such as Apache Iceberg in the data lake architecture can resolve a lot of these issues.

In this blog post we will look closely at how Apache Iceberg’s metadata enables fast performance for analytical queries over large data sets.

What are Iceberg Tables

Apache Iceberg is an open table format for data lake architecture with large analytic data sets, and was given to the Apache Software Foundation by Netflix, who developed it. Lately, Apache Iceberg is becoming the de facto solution for managing data in data lakes. Some of the great features that Iceberg Tables offer are:

  • Agnostic to processing engines — Multiple compute engines such as Spark, Flink, Trino cluster, and Snowflake can co-exist and work on the same data set in a consistent and transactional manner.
  • Consistent results — Iceberg’s support for transactions and safe concurrency addresses long-standing correctness problems that we see in other table formats such as Hive.
  • In-place table evolution — Iceberg provides several features customers want from data lake tables, including:
    — Schema evolution
    — Hidden partitioning
    — Time travel
    — Partition evolution
  • Better performance — Iceberg avoids file listing when data is in a directory, which is/was a major issue with other table formats like Hive.

Apache Iceberg Table Metadata and Data Layers

Because a Hive table’s state is reconstructed by listing files, changes are not atomic, therefore potentially producing inaccurate results, and listing operations are slow for large-scale analytics. Apache Iceberg overcomes these problems by maintaining its own metadata layer. As you can see in the diagram below, there are two layers that Iceberg uses for persisting the data. One is the metadata layer and the other is the data layer which is the actual data in formats such as Apache Parquet, Avro, or ORC. The file pruning intelligence of Iceberg comes from the metadata layer, where it can efficiently track which files and folders to eliminate first, and subsequently scan the data file stats to figure out if that file needs to be read or not for a specific query.

Metadata layer:

  • Metadata file — The metadata file stores information about a table schema, partition information, and the snapshot details for the table.
  • Manifest List file — Stores the information about all the manifest files and anchors as index for manifest files that are part of a snapshot, along with some additional details, such as how many data files got added, deleted along with partition boundaries.
  • Manifest file — Stores a list of data files (stored in Parquet/ORC/AVRO format), along with the column-level metrics and stats used for file pruning.

Data Layer:

Data files: Files that can be in Parquet, ORC, or AVRO format, and which store the actual data for the Iceberg Table.

Deep Dive into Iceberg Metadata

Creating Iceberg Tables

Before we understand the metadata layer of the Iceberg Table, let’s create one first. For this demonstration we are using Azure Data Lake Storage Gen-2 storage account to persist both metadata and the table data. We are partitioning the table by month using the column ORDERED_DATE. For the compute, we are using Spark cluster to create and work on the Iceberg Table.

create database IcebergDB;
create table IcebergDB.InvoiceDetails (
INVOICE_NUMBER bigint,
CUSTOMER_NAME VARCHAR(25),
CUSTOMER_ADDRESS VARCHAR(40),
CLERK_NAME VARCHAR(15),
LINE_NUMBER int,
PART_NAME VARCHAR(55),
BRAND VARCHAR(10),
QUANTITY int,
PRICE decimal(12,2),
DISCOUNT decimal(12,4),
TAX decimal(12,4),
TOTAL_PRICE decimal(12,3),
ORDERED_DATE DATE
) using iceberg
PARTITIONED BY (months(ORDERED_DATE));

Iceberg has a tree structure that stores snapshots of the data created for every DML/DDL operation on the Iceberg Table. As we can see from the metadata folder, multiple metadata.json files have been created. Any time you perform DML or DDL, you will see a new metadata file that has been created.

The first thing we need to determine is which is the latest metadata.json file. There are multiple ways to accomplish this. You could look at the increasing order of the serial number (for example, v1.metadata.json, v2.metada.json) to identify the latest one, or you could open up the version-hint.text and identify the latest number that points to the latest metadata.json file. Below is the snippet from the version-hint.text:

Image by Author: Metadata json files
Image by Author: Content from Version-hint file

As you can see in the example above, the latest number is 2 and so the latest version of the metadata file will be v2.metadata.json.

Now let’s insert a single record in the Iceberg Table and understand the metadata that is created.



insert into IcebergDB.InvoiceDetails
select 59999846,'Customer#001083696','9atzserwefwdf,','Clerk#000009978',1,'wer brown magenta black','Brand#55',
25,990.01,0.002,0.001,52091.36,date'1992-06-25';

Metadata JSON File

Now we will go through all the files that are part of the metadata layer. Let’s start with the v2.metadata.json file. As you can see in the image below, this file has details about the schema of the table, current snapshot id information, and the manifest-list file. This information helps us to identify the latest snapshot id and the latest manifest list file for the snapshot of the table. Metadata file generated in the customer-managed storage only has the details about the latest snapshot id.

Meatadata.json file

Image by Author: Metadata json content

This file has the details about the current snapshot and associated manifest list details. Apart from the current snapshot, it even stores the history of the snapshot, which helps in using features such as Time Travel. Apart from the schema for the table, it also stores the partition details along with the column id, which is used for partitioning the table as it helps in partition evolution. We will generate a new metadata.json file by inserting more records into the table. Here we are using a TPC-H sample database to insert records into our Iceberg Table and inserting 5 months of data partitioned by month.

INSERT INTO IcebergDB.InvoiceDetails
select o_orderkey as invoicenumber,
c_name as customername,
c_address as customeraddress,
O_CLERK as Clerkname,
L_LINENUMBER as linenumber,
P_NAME as partname,
P_BRAND as brand,
L_QUANTITY as quantity,
P_RETAILPRICE as price,
L_DISCOUNT as discount,
l_tax as tax,
o_totalprice as total_price,
O_ORDERDATE as ordered_date
from tpch.orders o
left join tpch.lineitem l on o_orderkey = l_orderkey
left join tpch.part p on p_partkey = l_partkey
left join tpch.customer c on o.o_custkey = c.c_custkey
left join tpch.nation n on c_nationkey = n_nationkey
where o_orderdate between '1992-01-01' and '1992-05-31' order by o_orderdate;

Below is the snapshot information from the latest metadata.json file:

Image by Author: Snapshot information in Metadata json file

You can see the list of new files that have been added (12 files) as well as the new partitions created as part of the transaction. In addition, the medata.json file maintains the history of the snapshot.

Now that we understand the metadata JSON file, let’s look more closely at the manifest list to better understand how this file is used by Iceberg for directory pruning, based on the predicated clause used in the select queries. You can simply use the current-snapshot-id from the latest metadata.json and look for the name and path for the manifest list file. The manifest list file name is something like snap-(current-snapshot-id)*.avro as shown in the screenshot below:

Image by Author: Manifest List Location

Manifest List File

As we have already discussed, the manifest list contains the list of manifest files that are part of a snapshot, along with some additional details, such as how many data files got added or deleted. Let’s see that in action. All the manifest list and manifest files are in avro format. Below are the details that you will see in the manifest list file. Manifest list includes metadata information, which can be used to avoid scanning all manifest files in a snapshot when planning to execute a query.

Image by Author: Manifest List Details

Some of the required fields in the manifest list file are:

  • Manifest_path — Location of the manifest file
  • Manifest_length — Length of the manifest file in bytes
  • Partition_spec_id — ID of a partition spec used to write the manifest; must be listed in table metadata
  • Added_snapshot_id — ID of the snapshot where the manifest file was added

You don’t see the actual value of the columns as the lower and upper bounds of the values in the partitions are serialized to bytes using the single-object serialization.

In the next section we will talk about the manifest file. You can find the location of the file in the manifest_path.

Manifest File

This file contains the column-level stats for the Iceberg Table. As you can see in the snippet below, Iceberg metadata has captured different stats of the columns of the table, such as value_counts, null_value_counts, and lower and upper bound values. This information is very helpful during the query planning phase; it helps in identifying which set of Parquet files needs to be skipped and which needs to be read.

Image by Author: Metadata File Details

If you further expand the value counts, you will find the number of records in each of these columns.

Image by Author: Value count information

Now let’s insert more records into the Iceberg Table and look at both the data and metadata that is generated.

INSERT INTO IcebergDB.InvoiceDetails
select o_orderkey as invoicenumber,
c_name as customername,
c_address as customeraddress,
O_CLERK as Clerkname,
L_LINENUMBER as linenumber,
P_NAME as partname,
P_BRAND as brand,
L_QUANTITY as quantity,
P_RETAILPRICE as price,
L_DISCOUNT as discount,
l_tax as tax,
o_totalprice as total_price,
O_ORDERDATE as ordered_date
from tpch.orders o
left join tpch.lineitem l on o_orderkey = l_orderkey
left join tpch.part p on p_partkey = l_partkey
left join tpch.customer c on o.o_custkey = c.c_custkey
left join tpch.nation n on c_nationkey = n_nationkey
where o_orderdate between '1993-01-01' and '1993-06-30' order by o_orderdate;

When you execute the above command, you will find that a new metadata.json file is generated.

Image by Author: New Snapshot

Let’s look into the manifest list file and the manifest file to understand the column-level stats that are created after inserting more records in the table. You can see in the screenshot below that the manifest list tracks the count of records that were inserted into the table, along with the number of data files created.

Image by Author: Manifest List Details

By looking more closely at the manifest file, we see there are entries for 14 data files created as part of the transaction. The snippet below is from one of the entries, and it contains information about how many records there are in the data file as well as the number of records in each column. These stats help Iceberg skip a data file (file pruning) if the value does not fall between the low and high boundary values.

Image by Author: Metadata File column stats for each parquet file

In the section below, we will look at how Spark makes use of table metadata to perform query optimizations using record counts, lower/upper bound value, and the predicate clause of the select statement for partition pruning (skipping partitioned folders). We’ll also look at how to use column statistic information in the Parquet files to identify which data files need to be skipped to fetch the records for the query submitted.

Using Metadata during Query Execution

Since we have partitioned our table based on ORDERED_DATE, folders for every month are created in the external storage in a YYYY-MM format, as shown below. And there are multiple Parquet files created in those folders.

Image by Author: Iceberg Table data partitioned by month

Below is the snippet of the parquet files which are created for the month 1992–02.Below is an example of the snippet of the Parquet files created for the month 1992–02:

Image by Author: Parquet files from single partition

Let’s run some select queries to understand how Iceberg does the partitioning pruning for the select queries with a predicate clause using the manifest list. In total there are 127 Parquet files in 12 folders for data from the period 1992–01–01 to 1992–06–30, and from 1993–01–01 to 1993–06–30. Let’s run a select query without any predicate clause and look at the query plan.

SELECT CUSTOMER_NAME
,BRAND
,SUM(QUANTITY) AS TOTAL_QUANTITY
,SUM(total_price) AS TOTAL_PRICE
FROM IcebergDB.InvoiceDetails
GROUP BY CUSTOMER_NAME,BRAND
ORDER BY TOTAL_PRICE DESC ;

Below is the query plan for the Spark job with no filters in the select query:

Image by Author: Query plan without predicate clause

As you can see, this query went through scanning all the files and no pruning occurred at either the directory or file level. Now we will add a filter clause on the date column, which is partitioned by month, and look at the query plan again:

SELECT CUSTOMER_NAME
,BRAND
,SUM(QUANTITY) AS TOTAL_QUANTITY
,SUM(total_price) AS TOTAL_PRICE
FROM IcebergDB.InvoiceDetails
WHERE ORDERED_DATE BETWEEN '1992-01-01' and '1992-02-15'
GROUP BY CUSTOMER_NAME,BRAND
ORDER BY TOTAL_PRICE DESC ;

Below is the query plan for the Spark job running the query having a predicate clause on the ORDERED_DATE.

Image by Author: Query plan with predicate clause

As you can see from the query plan above, the query has only read 4 Parquet files out of the 127 Parquet files in the first two partitions (1992–01 and 1992–02). In this scenario, directory-level pruning occurs based on the predicate clause, simply skipping the folders that are not required, and later file-level pruning occurs based on the file-level statistics.

Using Iceberg Tables, you can expect lower query compilation times than the query run times. In the next post we will discuss Iceberg hidden partitions and partition evolution. As you read this series, we hope you will be truly amazed by some of the great features Apache Iceberg Table has to offer.

Conclusion

In this blog post we have seen what makes up the Iceberg Table metadata, and got an understanding on how the stats for the table in the manifest files are used while queries are executed at runtime. Iceberg is a great choice when the requirement is to keep the data in open format, as it offers support for interoperability between compute engines to work on the same data set with great performance.

If you’d like to learn more about Iceberg, I suggest a few resources:

--

--