Version control your data lake house using Apache Iceberg and Project Nessie

Oliver Rise Thomsen
8 min readSep 19, 2022

--

Combined Logos of Apache Iceberg and Project Nessie

Imagine how a git like experience can improve the data lake house

Introduction

In the modern data world, the data lake is becoming a key part of every data team. The data lake provides the data team with an ocean of flexibility which is awesome but should be handled with care such that it does not become a data swamp. A key reason why this can happen easily is that data has a state, thus being very hard to track and version control easily. A data team, therefore, need a strict set of policies to govern it and establish good DataOps which is hard.

At the company I work at, we have recently seen a huge increase in demand for the data warehouse combo Snowflake and dbt (Data Build Tool). A significant factor in this adoption is the way that Snowflake has Zero Copy Cloning and dbt can version control your data models. This means that developers can easily develop their data models in a separate environment, but still work on production data. This is very valuable.

These innovations are slowly coming to the data lake as well which means that developing data lake houses is becoming easier. The first part of this is the development of table formats like Delta Lake, Hudi and Iceberg which are seeing wide adoption. The second part of this is data version control which enables git-like version control on the data lake. Examples of these are LakeFS and Project Nessie.

In the rest of this blog, we will explore how Apache Iceberg and Project Nessie complement each other to give us a git-like experience on the data lake.

Apache Iceberg is a modern table format for the data lake built for high performance, reliability and simplicity. This means that it keeps the underlying data files in a table and provides an efficient way to scan the files in the table. It supports a full SQL syntax, schema evolution, time travel and optimization. This means that it is possible to enjoy many of the advantages of a tool like Snowflake on a data lake which can be consumed by any tool e.g. Spark, Flink, Trino or Dremio.

Project Nessie is a transactional catalog for data lakes that provides a git-like way of version control for data that can be used with most popular compute engines. It uses the Apache Iceberg or Delta Lake to keep track of the underlying files and uses it to make it possible to make branches on the data. It does it by replacing the catalog e.g. Hive for storing tables.

When combining these two technologies it is possible to time-travel on your data, do full schema evolution, and have a git-like experience with your data which can transform the way that development happens on the data lake.

The git-like version control makes it possible to create branches on the data that can be merged into the main branch again. This can be used in many ways but the three main use cases are listed here:

  • Development — Whenever a new table or iteration needs to be created a development branch will be created where the development and testing are done. When finished it is merged into the main branch just like code.
  • ETL/ELT — The load jobs can be placed on a separate branch, meaning that the whole load job can be done separately from the main branch. This means that it can be ensured that the whole load is done before it appears in the main table.
  • Analysis — When an analyst wants to go exploring the data a new branch can be made just for this. This ensures that they can do their analysis in a safe environment.

The rest of this blog will highlight some of these features with Apache Spark, Apache Iceberg and Project Nessie.

Setup

I have extended the docker-spark-iceberg from Tabular to use as a test environment. This is a docker-compose setup. This means that the default PostgreSQL catalog is replaced with Project Nessie and that installed the Nessie Spark SQL extension which enables Nessie SQL commands.

Please let me know if you want to see a post or a repository on it in the future.

Development

The first thing we will dive into is how we can use this functionality for development. We will start by setting up the development branch.

List of branches in the Nessie catalog

We can now move on to the development itself. In this demo case, we will just import a single table into our lake house. For this, we will use the famous NYC taxi data. The data is loaded with Pyspark and registered as a temporary view called taxis_dev. We can now create the Iceberg table with partitioning.

Create table

We can now take a look at our newly created table. In addition, we can describe it and see some of the underlying table information e.g. partitioning and physical location.

Sample of the data
Table Description

After loading the data into a table in the development branch, we can use the commands below to verify that our table only exists in the development branch.

We are now satisfied with our work on the dev branch and can proceed to merge it into the main.

With the small development example shown above, we can imagine how the workflow could be eased a lot by introducing data branching.

ETL

Another great way of utilizing the branching capabilities is to use it for ETL, ELT or just batch processing. An ETL branch can be created where the data load will take place. When this job is successful the ETL branch can be merged into the main. This ensures that if any errors should appear in the load they will not affect anything in the main branch. This operation is very similar to the dev performed above. Data will just be inserted into the table instead of being created. It is therefore left out of the post here, but it can be found in the complimentary notebook.

Analytics

The branching can also be used for analysis. It makes it possible to create a separate environment where the analysis can be performed and where their data won’t change as it will be frozen at the point in time when the analysis is performed. This means that there is no need to create a separate data extract and make them available for the analyst via email, folder or alike. Just let them enjoy the lake house. An example of this can also be found in the complimentary notebook.

Rename, Update and Delete

Lastly, we can use it for refactoring tables in the warehouse. Apache Iceberg has great capabilities for this and combined with Project Nessie we can ensure that this happens in a separate environment. We start by creating and switching to a new branch.

We now perform some changes to the table. We will rename some columns, drop a column and delete some data.

We can now ensure that our table looks like it should.

Lastly, we can merge the new changes into the main branch where they will then take effect.

Extras

Lastly, we will take a look at some extra functions from Iceberg and Nessie that could be useful.

We can take a look at the table history which shows when a snapshot is created and which snapshot is its predecessor.

Table History

We can take a look at the snapshot metadata.

Snapshots

We can revert to a snapshot either by its id or a specific timestamp. This can be extremely useful as a fail-safe or for auditing.

Table Rollback

We can rewrite the data files of our table for maintenance.

Table Rewrite

We can see the logs of the table which is very useful to figure out what has happened on the table.

We can list the partitions which could be useful to see if we are partitioning over a column that makes sense by looking at the row count in each partition.

Lastly, we can remove old snapshots for maintenance and privacy/GDPR. This has to be done in two operations. First, the snapshots are removed from the catalog and secondly, the files are removed physically from the storage. In the first operation, the second argument is a timestamp which tells Iceberg that snapshots before this timestamp should be removed. The third argument tells Iceberg the minimum amount of snapshots to retain.

Sum Up/Future

To sum up, I have really enjoyed my first hand experience with the combination of these two project. I have not tried them in production yet, but I hope that I get to do that in the future as I can really see the potential for improved DataOps.

I am a huge fan of dbt and I can definitely see how it could be a cool sandwich to use. dbt offers great version control of your data models, and Nessie/Iceberg could provide the flexibility of the Snowflake/dbt combo on the data lake e.g. give each developer their own schema to work within.

Now that a git-like experience has been introduced, you may ask what about merge conflict? These will occur just like in git if there is a conflict. You will need to resolve it and this it how it should be.

You may also ask what happens if I try to connect to the data with a tool that does not support Nessie. In this case, you will just be shown the main branch and thus should not worry about getting a data mess up.

I hope that you have enjoyed this overview of Apache Iceberg and Project Nessie. I have the highest respect for the developers of both projects.

Disclaimer: I have only tried both projects shortly and have not used them in production. Thus I am by no means an expert on the subject. I have no affiliation with either project.

Sources/Inspiration

Distributed Transactions on the Data Lake w Project Nessie

Tabular — Docker, Spark, and Iceberg: The Fastest Way to Try Iceberg!

Nessie Binder Demos

--

--

Oliver Rise Thomsen

A master student with a big interest in Data Engineering and Data Science. Working @Intellishore