Learning Apache Iceberg — looking at append, update and delete operations

Marin Aglić
12 min readFeb 5, 2024

--

Bridging the gap between what I knew and what I wanted to learn. This is the fourth in a series of articles about Apache Iceberg. In the previous article, we connected Iceberg to Minio S3. In this one, we’ll take (another) look at what happens when performing append, delete, and update operations. You can find the links to the previous articles: first, second, third.

Image generated by ChatGPT-4

Introduction

The questions presented in the first article basically set up the series of these articles on Apache Iceberg. At the beginning I was very confused at what Apache Iceberg is. However, I was still interested in the technology and wanted to learn more. So, I posted some questions I had about it:

  1. How does Iceberg fit into Spark? Is it a framework on top of Spark? Something else?
  2. Can it run remotely from Spark? Like another service that can communicate over a port?
  3. How is the data ingested into these so-called Iceberg tables? What happens, then?
  4. What would it mean to modify the data in these tables?
  5. Can I use the standalone cluster that I already had prepared?

In the first article, I wrote that Iceberg is a package for Spark. But, there are nicer descriptions (example from dremio [1]):

- A table format specification
- A set of APIs and libraries for engines to interact with tables following that specification

The second article (found here):

  1. shows how to store the catalog to Postgres;
  2. provides an in depth look about how the metadata is stored in metadata, manifest list, and manifest files.

For the first two stories, both the data and metadata were stored on the file system.

The third articles (found here) shows how to connect Iceberg to Minio S3 and store the data and metadata files to a bucket.

In this article, I hope to answer the remaining questions through examples of creating a table, and adding, deleting and updating the data in the table.

We’ll start by adding a new notebook to see what happens step by step. I’m also using the same source code as I used in the previous stories. It can be found here.

NOTE: this story actually started as the second part of a much larger one, but I decided to split it into two. This first part covers connecting iceberg to Minio S3. The second one looks into appending, deleting and updating data.

The Jupyter notebook used

In this section, we’ll present the notebook used for this story. The notebook needs covers creating the table, adding, updating, and deleting data. The operations presented should help with answering questions three and four.

Here are the main contents of the notebook:

from pyspark.sql import SparkSession
from pyspark.sql.types import StructType,StructField, StringType, IntegerType

spark = SparkSession.builder.appName("Jupyter").getOrCreate()

data = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]

schema = StructType([ \
StructField("firstname",StringType(),True), \
StructField("middlename",StringType(),True), \
StructField("lastname",StringType(),True), \
StructField("id", StringType(), True), \
StructField("gender", StringType(), True), \
StructField("salary", IntegerType(), True) \
])

df = spark.createDataFrame(data=data, schema=schema)
df.writeTo("db.minio").createOrReplace()

new_data = [("New James","","Smith","36646", "M", 50)]
df = spark.createDataFrame(data=new_data, schema=schema)
df.writeTo("db.minio").append()

res = spark.sql("SELECT * FROM db.minio")
res.show()

spark.sql("DELETE FROM db.minio WHERE salary < 0")
table = spark.table("db.minio")
table.show()

spark.sql("UPDATE db.minio set salary = 220 WHERE id == '36646'")
table.show()
spark.stop()

The third question

The third question — How is the data ingested into these so-called Iceberg tables? What happens, then?

We answer this question through two operations:

  1. creating a table
  2. adding a row to the table.

Creating a table

Iceberg stores the data as Parquet files. We get the table once we execute this line:

df.writeTo("db.minio").createOrReplace()

The line creates two data and three metadata files. The five rows initialised with:

data = [("James","","Smith","36636","M",3000),
("Michael","Rose","","40288","M",4000),
("Robert","","Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown","","F",-1)
]

are split among the two data files. One data file three rows:

{"firstname":"Robert","middlename":"","lastname":"Williams","id":"42114","gender":"M","salary":4000}
{"firstname":"Maria","middlename":"Anne","lastname":"Jones","id":"39192","gender":"F","salary":4000}
{"firstname":"Jen","middlename":"Mary","lastname":"Brown","id":"","gender":"F","salary":-1}

while the other contains the remaining two:

{"firstname":"James","middlename":"","lastname":"Smith","id":"36636","gender":"M","salary":3000}
{"firstname":"Michael","middlename":"Rose","lastname":"","id":"40288","gender":"M","salary":4000}

As for the metadata files — there are three files:

  1. metadata json file (in my case the name is: 00000-f5b4c31a-cf6e-4723-a086–0c66c6a102f4.metadata.json)
  2. manifest list avro file (starts with snap)
  3. manifest file.
The metadata files in the Minio S3 bucket

The metadata file references the manifest list file, which in turn references the collection of manifest files.

The metadata files holds the data for a collection of snapshots, but also points to the current snapshot id:

"current-snapshot-id" : 7426647800932772370,

If you take a look at your manifest list file (starts with snap), you should see (to the best of my knowledge) that it’s name starts with snap-<snapshot-id>-<etc...>. In my case the name of the manifest list file is:

snap-7426647800932772370–1-b10658da-8308–41c7–9209–217725fa8660.json

In the manifest list file, there should be an array of dictionaries with key-value pairs. At this moment, I have only one such dictionary with a reference to the manifest file:

[{
"manifest_path": "s3://iceberg-data/db/minio/metadata/b10658da-8308-41c7-9209-217725fa8660-m0.avro",
}]

As for the manifest file (b10658da-8308–41c7–9209–217725fa8660-m0.avro), it contains an array of dictionaries, each of which references a data file. Here are (most of) the contents:

[
{
"status": 1,
"snapshot_id": 7426647800932772370,
"sequence_number": null,
"file_sequence_number": null,
"data_file": {
"content": 0,
"file_path": "s3://iceberg-data/db/minio/data/00000-0-9a99aa40-5d08-4d7e-850a-d307d23f5c0f-00001.parquet",
"file_format": "PARQUET",
"partition": {},
"record_count": 2,
"file_size_in_bytes": 1641,
"column_sizes": [omitted],
"value_counts": [omitted],
"null_value_counts": [omitted],
"nan_value_counts": [],
"lower_bounds": [omitted],
"upper_bounds": [omitted],
"key_metadata": null,
"split_offsets": [omitted],
"equality_ids": null,
"sort_order_id": 0
}
},
{
"status": 1,
"snapshot_id": 7426647800932772370,
"sequence_number": null,
"file_sequence_number": null,
"data_file": {
"content": 0,
"file_path": "s3://iceberg-data/db/minio/data/00001-1-9a99aa40-5d08-4d7e-850a-d307d23f5c0f-00001.parquet",
"file_format": "PARQUET",
"partition": {},
"record_count": 3,
"file_size_in_bytes": 1724,
"column_sizes": [omitted],
"value_counts": [omitted],
"null_value_counts": [omitted],
"nan_value_counts": [],
"lower_bounds": [omitted],
"upper_bounds": [omitted],
"key_metadata": null,
"split_offsets": [omitted],
"equality_ids": null,
"sort_order_id": 0
}
}
]

You can notice that the manifest file also keeps track of the snapshot id in each dictionary and the number of records and size of each data file.

Adding a row to a table

Now, let’s try appending a single row (this part of the notebook):

new_data = [("New James","","Smith","36646", "M", 50)]
df = spark.createDataFrame(data=new_data, schema=schema)
df.writeTo("db.minio").append()

What happens? We get some new metadata files, and a new data file.

The new data file will contain the new row:

{"firstname":"New James","middlename":"","lastname":"Smith","id":"36646","gender":"M","salary":50}
The data files in Minio S3 bucket after adding a new row

As for the metadata files, we get three new metadata files (as can be seen in the image).

The metadata files after adding a new row to the table

First, we get a new metadata file. The new metadata file has the whole snapshot history and a reference to the current snapshot:

"current-snapshot-id" : 2032559078621466157,
...
"snapshot-log" : [ {
"timestamp-ms" : 1706516546185,
"snapshot-id" : 7426647800932772370
}, {
"timestamp-ms" : 1706518166494,
"snapshot-id" : 2032559078621466157
} ],
...

The new manifest list file name is:

snap-2032559078621466157-1-274b9f78-7835-494a-88e7-c4a7fbc87659.json

with two dictionaries now pointing to the manifest files:

[
{
"manifest_path": "s3://iceberg-data/db/minio/metadata/274b9f78-7835-494a-88e7-c4a7fbc87659-m0.avro",
...
},
{
"manifest_path": "s3://iceberg-data/db/minio/metadata/b10658da-8308-41c7-9209-217725fa8660-m0.avro",
...
}
]

The new manifest file contains a reference to the new data file, as can be seen:

[
{
"status": 1,
"snapshot_id": 2032559078621466157,
"sequence_number": null,
"file_sequence_number": null,
"data_file": {
"content": 0,
"file_path": "s3://iceberg-data/db/minio/data/00007-9-012e468c-5234-40d8-8ce4-d7ec420a068f-00001.parquet",
"file_format": "PARQUET",
"partition": {},
"record_count": 1,
"file_size_in_bytes": 1636,
...
}
]

Conclusion — adding a row

So, when we added the new row, we got:

  1. a new data file,
  2. a new metadata file,
  3. new manifest list file pointing to the three manifest files (two old and one new),
  4. new manifest file pointing to the new data file.

I also covered the metadata files in more detail in the previous story here.

Answering the fourth question — deleting and updating a row

The fourth questionWhat would it mean to modify the data in these tables?

Deleting a row

Let’s first look at what it means to delete a row. We have the following lines in the notebook:

spark.sql("DELETE FROM db.minio WHERE salary < 0")
table = spark.table("db.minio")
table.show()

The result (of table.show()) is the following table:

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
|New James| | Smith|36646| M| 50|
| Robert| |Williams|42114| M| 4000|
| Maria| Anne| Jones|39192| F| 4000|
| James| | Smith|36636| M| 3000|
| Michael| Rose| |40288| M| 4000|
+---------+----------+--------+-----+------+------+

We, again, got some new data and metadata files. Here are the data files in the bucket after running the delete:

Data files in Minio S3 storage bucket after running a delete

Let’s introduce some variables to reference the data files:

  1. Data File A — 00001–1–9a99aa40–5d08–4d7e-850a-d307d23f5c0f-00001.parquet
  2. Data File B — 00000–0–9a99aa40–5d08–4d7e-850a-d307d23f5c0f-00001.parquet
  3. Data File C — 00007–9–012e468c-5234–40d8–8ce4-d7ec420a068f-00001.parquet
  4. Data File D — 00000–14-ab2a6c1e-23c1–4aa1–8d9a-09a9d7d97238–00001.parquet

The new data file D has the following contents:

{"firstname":"Robert","middlename":"","lastname":"Williams","id":"42114","gender":"M","salary":4000}
{"firstname":"Maria","middlename":"Anne","lastname":"Jones","id":"39192","gender":"F","salary":4000}

Data files A and B were created when we created the table at the beginning of the notebook. Data file C was added when we appended a new row, and data file D was created after running the Delete query.

So, the contents of data file D are the same as the contents of data file A but without the row that we deleted.

Here are the new metadata files:

New metadata files after delete

This time, we got 4 new metadata files. The new metadata json file has new values for the current snapshot id and snapshot log fields. The old metadata json files are referenced under the metadata-log key.

The new manifest list file (starting with snap), now references three manifest files:

[
{
"manifest_path": "s3://iceberg-data/db/minio/metadata/d4223bc8-a7bd-46b3-80f3-7434fb900e74-m1.avro",
...
},
{
"manifest_path": "s3://iceberg-data/db/minio/metadata/274b9f78-7835-494a-88e7-c4a7fbc87659-m0.avro",
...
},
{
"manifest_path": "s3://iceberg-data/db/minio/metadata/d4223bc8-a7bd-46b3-80f3-7434fb900e74-m0.avro",
...
}
]

Two of these three manifest files are new ones.

There is one manifest file that we are no longer referencing:

b10658da-8308-41c7-9209-217725fa8660-m0.avro

This is the first manifest file created which pointed to data files A and B.

The two new manifest files have almost the same name, with one ending with m0 and the other m1.

The contents of the new manifest file ending with m0 are almost the same as the no longer referenced manifest file. As if the new manifest file replaced the old one…

Let’s look at some of the contents:

[
{
"status": 0,
"snapshot_id": 7426647800932772370,
"sequence_number": 1,
"file_sequence_number": 1,
"data_file": {
"content": 0,
"file_path": "s3://iceberg-data/db/minio/data/00000-0-9a99aa40-5d08-4d7e-850a-d307d23f5c0f-00001.parquet",
"file_format": "PARQUET",
"partition": {},
"record_count": 2,
...
}
},
{
"status": 2,
"snapshot_id": 7825605915503001692,
"sequence_number": 1,
"file_sequence_number": 1,
"data_file": {
"content": 0,
"file_path": "s3://iceberg-data/db/minio/data/00001-1-9a99aa40-5d08-4d7e-850a-d307d23f5c0f-00001.parquet",
"file_format": "PARQUET",
"partition": {},
"record_count": 3,
...
}
}
]

As we can see, this manifest file references data files A and B. However, the status has changed. For the first dictionary — the status was set to 0, while for the second it was set to 2. The second dictionary’s snapshot id was also updated.

Here is a brief explanation of the status key from the docs [2]:

When a file is added to the dataset, its manifest entry should store the snapshot ID in which the file was added and set status to 1 (added).

When a file is replaced or deleted from the dataset, its manifest entry fields store the snapshot ID in which the file was deleted and status 2 (deleted). The file may be deleted from the file system when the snapshot in which it was deleted is garbage collected, assuming that older snapshots have also been garbage collected [*].

The status 0 indicates that the data file already existed, which it did. Status 2 tells us that the data file is deleted from the manifest and can be garbage collected (under some condition). The snapshot id was updated to reference the snapshot in which the file was deleted.

Hence, data file A will no longer be used.

Additionally, the fields sequence number and file sequence number were also updated for both dictionaries. For more details see [2].

The other manifest file (ending with m1) references the new data file D as can be seen:

[
{
"status": 1,
"snapshot_id": 7825605915503001692,
"sequence_number": null,
"file_sequence_number": null,
"data_file": {
"content": 0,
"file_path": "s3://iceberg-data/db/minio/data/00000-14-ab2a6c1e-23c1-4aa1-8d9a-09a9d7d97238-00001.parquet",
"file_format": "PARQUET",
"partition": {},
"record_count": 2,
"file_size_in_bytes": 1656,
...
}
}
]

Status 1 means that data file D was added.

Here is the list of the data files and how they are referenced:

  1. Data file A — referenced by manifest d4223bc8-a7bd-46b3–80f3–7434fb900e74-m0.json with status deleting, AND
    b10658da-8308–41c7–9209–217725fa8660-m0.json with status added;
  2. Data file B — referenced by manifest d4223bc8-a7bd-46b3–80f3–7434fb900e74-m0.json (same as data file A), AND
    b10658da-8308–41c7–9209–217725fa8660-m0.json with status added;
  3. Data file C — referenced by manifest 274b9f78–7835–494a-88e7-c4a7fbc87659-m0.json
  4. Data file D — referenced by manifest d4223bc8-a7bd-46b3–80f3–7434fb900e74-m1.json

Note that the manifest file b10658da-8308–41c7–9209–217725fa8660-m0.avro is not referenced by the newest manifest list file.

Conclusion — deleting a row

Deleting a row did basically the following:

  1. created a new file with the data omitting the deleted row;
  2. created two manifest files that basically replace the old manifest file — one references the new data file, the second one updates the status of the old data file;
  3. the new manifest list file omits the manifest file that referenced the original data file (the one that includes the deleted row).

Updating a row

Finally, let’s look at what it means to update a row. For this part, we execute the following lines in the notebook:

spark.sql("UPDATE db.minio set salary = 220 WHERE id == '36646'")
table.show()

We’ll see the following table:

+---------+----------+--------+-----+------+------+
|firstname|middlename|lastname| id|gender|salary|
+---------+----------+--------+-----+------+------+
|New James| | Smith|36646| M| 220|
| James| | Smith|36636| M| 3000|
| Michael| Rose| |40288| M| 4000|
| Robert| |Williams|42114| M| 4000|
| Maria| Anne| Jones|39192| F| 4000|
+---------+----------+--------+-----+------+------+

After update, we get 1 new data file (E) and 4 new metadata files.

The data files stored in the Minio S3 bucket

The contents of the new file is the row that we changed (updated):

{"firstname":"New James","middlename":"","lastname":"Smith","id":"36646","gender":"M","salary":220}

Therefore, the data file E is an updated version of data file C.

And here are the metadata files:

The metadata files in the bucket after update

Again, we get a new metadata file that:

  1. references the current snapshot id via current-snapshot-id,
  2. has an updated snapshot-log,
  3. and references the old metadata files via the metadata-log key.

Just like with the delete query, the update produced a new manifest list that references two new manifest files. The old manifest file, 274b9f78–7835–494a-88e7-c4a7fbc87659-m0.json, that referenced data file C is removed from the manifest list.

Just like with the delete query, one of the new manifest files (endign with m0) references data file C, but sets the status to deleting (2). The other manifest file (ending with m1) references the new data file E.

Here is the list of how the data files are referenced:

  1. Data file A — referenced by manifest d4223bc8-a7bd-46b3–80f3–7434fb900e74-m0.json with status deleting
  2. Data file B — referenced by manifest d4223bc8-a7bd-46b3–80f3–7434fb900e74-m0.json (same as data file A)
  3. Data file C — referenced by manifest c5934717–68b5–4262–8218-a2de395ba51e-m0.json with status deleted (status 2), AND
    manifest 274b9f78–7835–494a-88e7-c4a7fbc87659-m0.json with status added
  4. Data file D — referenced by manifest d4223bc8-a7bd-46b3–80f3–7434fb900e74-m1.json
  5. Data file E — referenced by manifest c5934717–68b5–4262–8218-a2de395ba51e-m1.json

Note that we now have two manifest files that are no longer referenced by the newest (current) manifest list:

  1. b10658da-8308–41c7–9209–217725fa8660-m0.json (references data files A and B);
  2. 274b9f78–7835–494a-88e7-c4a7fbc87659-m0.json (references data file C).

Conclusion — updating a row

With regard to the points of interest discussed in this story — the delete and update operations seem to have a similar impact on the creation of data and metadata files. All of the data and metadata files were kept. The manifest list files replaced one old manifest file with two new ones. One of those marks the old data file as deleted.

Note that, while not discussed in this story, there may be other changes that Apache Iceberg does when appending, deleting or updating the data rows.

The fifth (final) question

For the fifth questionyeah, I guess I was able to use a previously prepared standalone cluster (with some modifications). I used it to demonstrate apache iceberg in four stories (that includes this one).

Summary

I found Apache Iceberg interesting as a technology. As I mentioned in the previous story, I found it interesting how Iceberg organises the table data and metadata under the hood.

In this story, we looked at the metadata and data files created when running create, append, delete and update operations.

Hope you found it useful.

The code is available on GitHub here. It also includes the metadata and data files that I discussed in the story. The files are located under the medium-data folder.

Additionally, I couldn’t open avro files in visual studio code or PyCharm. Therefore, I used the script print_avro_contents.py to transform them to JSON format. I also include these JSON files.

References

  1. https://www.dremio.com/resources/guides/apache-iceberg-an-architectural-look-under-the-covers/
  2. Iceberg spec — https://iceberg.apache.org/spec/

--

--

Marin Aglić

Working as a Software Engineer. Interested in Data Engineering. Mostly working with airflow, python, celery, bigquery. Ex PhD student.