Geek Culture
Published in

Geek Culture

As cool as Iceberg

source: https://iceberg.apache.org/

Where does the Table format fits

Apache Iceberg is a layer on top of File format like parquet or ORC and the compute will make use of this table format on top of the file formats which is stored in the storage.

Introduction

Apache Iceberg is an open table format for huge analytic datasets. Iceberg adds tables to compute engines including Spark, Trino, PrestoDB, Flink and Hive using a high-performance table format that works just like a SQL table.

Background

Apache Iceberg was invented by Netflix and later open sourced to Apache Foundation.

Goals of Iceberg

Iceberg has tried to answer a lot of pain points we have in the current generation of data.

  1. Support to in place file formats. How can we get the best-est of performance out of the data which I already have and without changing the pipeline or the file format.
  2. Schema evolution. As the time changes, the schema will change. How can we be able to support the schema evolution over time without braking the downstream applications.
  3. Atomic operations in BIGDATA. It’s a big ask in the big data world as to achieve the ACID compliance.
  4. To get rid of directory and file listing. Every query requires a file listing and discarding based on the footer(parquet).
  5. Altering partitioning column over the time and should support previous and new partitioning columns.
  6. Isolation between read and write.
  7. Implicit partitioning and I expect the engine to take care of my partitioning columns.
  8. Versioning and time travel.
Photo by Sebastian Herrmann on Unsplash

ICEBERG Answers

1. Support to in place file formats

As we have already discussed Iceberg is not a file format but it’s a table format.

2. Schema evolutions

Iceberg does supports ADD, DROP, RENAME, UPDATE or REORDER of a column.

3. Atomic Operations

All the writes will always work at isolation without impacting the current read or the current schema in the metadata.

4. To get rid of directory and file listing

The traditional file format requires you to read the list of directories in case of partitioning and list all the files within the partition and based on the footer exclude the files which are not in the filter clause.

5. Altering the partitioning columns

The architecture we have defines years back will not be able to support all my future use cases.

6. Isolation between read and write

Iceberg maintains the snapshots of the files which changed as time progresses. This will support the READ and WRITE to occur parallel but in isolation.

7. Implicit partitioning

Defining the partitioning shouldn't be mandatory. As m requirement will change with the time. Having to know the partitioning at the run time will improve your query performance exponentially.

8. Time travel

As e have already discussed the Iceberg supports the versioning. This makes the time travel possible if you are required to go back to a previous version as per your requirement.

Metadata Architecture

Demo Time

Let’s create a dummy table and write as iceberg and list the folder path.

Some tips while you use the Iceberg

Increasing number of snapshots: Over the time you tend to keep adding new snapshots of a file which will keep growing so even the metdata files.

Reference

--

--

A new tech publication by Start it up (https://medium.com/swlh).

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ajith Shetty

Bigdata Engineer — Love for BigData, Analytics, Cloud and Infrastructure. Want to talk more? Ping me in Linked In: https://www.linkedin.com/in/ajshetty28/