Apache Iceberg Table Maintenance using PySpark

29 min readJun 19, 2024

Apache Iceberg has emerged as a powerful table format for managing large analytical datasets. Its features like schema evolution, time travel, and partition evolution make it a go-to choice for data engineers and scientists. However, maintaining these tables efficiently is crucial to ensure optimal performance and reliability. In this blog post, we’ll explore how to perform table maintenance tasks using PySpark and Apache Iceberg.

Note: The code example below uses a simple but powerful local development environment outlined in a previous article.

Why Table Maintenance Matters

Regular table maintenance is essential to keep data operations efficient. Without proper maintenance, tables can suffer from issues like:

Fragmented Data: Small files can accumulate, leading to inefficiencies.
Stale Metadata: Outdated metadata can slow down query performance.
Orphan Files: Unreferenced files can consume storage unnecessarily.

Apache Iceberg provides several utilities to address these issues, and PySpark offers the tools to automate and manage these tasks.

The Small File Problem

The small file problem in Iceberg occurs when a table is frequently written to and results in a large number of small files. This fragmentation can significantly degrade query performance because querying a large number of small files requires more metadata operations and increases the I/O overhead. Each file read incurs a certain fixed cost, and when the files are small, this overhead becomes disproportionately large compared to the amount of data read. Additionally, having many small files can overwhelm the storage in use, leading to inefficiencies and potential performance bottlenecks.

This problem is exacerbated in environments where data is ingested continuously or in near real-time, as each update generates new files. Over time, these files accumulate, making it increasingly difficult to manage and query the data efficiently. To mitigate the small file problem, Apache Iceberg provides mechanisms such as compaction, which consolidates small files into larger ones. This process helps reduce the number of files and improves read performance by minimizing the metadata and I/O overhead. Properly managing the frequency and size of data updates, along with regular maintenance tasks like compaction, is essential to maintaining optimal performance and efficiency in an Iceberg table.

Setting up the Environment

Let’s start with initializing our PySpark session and creating an Iceberg table.

from pyspark.sql import SparkSession

# Initialize Spark session with Iceberg configurations
spark = SparkSession.builder \
  .appName("IcebergLocalDevelopment") \
  .config('spark.jars.packages', 'org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.5.2') \
  .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
  .config("spark.sql.catalog.local", "org.apache.iceberg.spark.SparkCatalog") \
  .config("spark.sql.catalog.local.type", "hadoop") \
  .config("spark.sql.catalog.local.warehouse", "spark-warehouse/iceberg") \
  .getOrCreate()

# Create an Iceberg table
spark.sql("""
  CREATE TABLE local.fragmentation.data_points (
    id INT,
    name STRING,
    value INT
  ) USING iceberg""")

With our table created, let’s get a baseline of the snapshots, manifests, and data files managed by Iceberg (hint: there should be none because we haven’t populated the table with data).

spark.sql("SELECT * FROM local.fragmentation.data_points.snapshots").show(truncate=False)
+------------+-----------+---------+---------+-------------+-------+
|committed_at|snapshot_id|parent_id|operation|manifest_list|summary|
+------------+-----------+---------+---------+-------------+-------+
+------------+-----------+---------+---------+-------------+-------+

spark.sql("SELECT * FROM local.fragmentation.data_points.manifests").show(truncate=False)
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path|length|partition_spec_id|added_snapshot_id|added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
+-------+----+------+-----------------+-----------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.files").show(truncate=False)
+-------+---------+-----------+-------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+----------------+
|content|file_path|file_format|spec_id|record_count|file_size_in_bytes|column_sizes|value_counts|null_value_counts|nan_value_counts|lower_bounds|upper_bounds|key_metadata|split_offsets|equality_ids|sort_order_id|readable_metrics|
+-------+---------+-----------+-------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+----------------+
+-------+---------+-----------+-------+------------+------------------+------------+------------+-----------------+----------------+------------+------------+------------+-------------+------------+-------------+----------------+

Inserting Data

Let’s add some data to our table. We are going to use INSERT statements and just a few rows to demonstrate how to create an environment where the small file problems exists.

# First data insert
spark.sql("""
  INSERT INTO local.fragmentation.data_points VALUES 
    (1, 'metric_1', 5), 
    (2, 'metric_2', 10),
    (3, 'metric_1', 5), 
    (4, 'metric_2', 10),
    (5, 'metric_1', 5)
  """)

Let’s get an updated view of the snapshots, manifest files, and data files managed by Iceberg for our table.

spark.sql("SELECT * FROM local.fragmentation.data_points.snapshots").show(truncate=False)
+-----------------------+-------------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|committed_at           |snapshot_id        |parent_id|operation|manifest_list                                                                                                                  |summary                                                                                                                                                                                                                                                                                         |
+-----------------------+-------------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2024-06-18 12:37:25.354|5691300473162964711|NULL     |append   |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-5691300473162964711-1-900a94e9-7cd4-4b54-adcf-ab81302797ef.avro|{spark.app.id -> local-1718728023648, added-data-files -> 5, added-records -> 5, added-files-size -> 4324, changed-partition-count -> 1, total-records -> 5, total-files-size -> 4324, total-data-files -> 5, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}|
+-----------------------+-------------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.manifests").show(truncate=False)
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path                                                                                                   |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/900a94e9-7cd4-4b54-adcf-ab81302797ef-m0.avro|6817  |0                |5691300473162964711|5                     |0                        |0                       |0                       |0                          |0                         |[]                 |
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.files").show(truncate=False)
+-------+-------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+----------------------------------------------------------------------------------------+
|content|file_path                                                                                                          |file_format|spec_id|record_count|file_size_in_bytes|column_sizes               |value_counts            |null_value_counts       |nan_value_counts|lower_bounds                                                            |upper_bounds                                                            |key_metadata|split_offsets|equality_ids|sort_order_id|readable_metrics                                                                        |
+-------+-------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+----------------------------------------------------------------------------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-4-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 1, 1}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 5, 5}}  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00001-5-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [02 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|{1 -> [02 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 2, 2}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 10, 10}}|
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00002-6-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet|PARQUET    |0      |1           |864               |{1 -> 35, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [03 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [03 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{35, 1, 0, NULL, 3, 3}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 5, 5}}  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00003-7-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [04 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|{1 -> [04 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 4, 4}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 10, 10}}|
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00004-8-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [05 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [05 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 5, 5}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 5, 5}}  |
+-------+-------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+----------------------------------------------------------------------------------------+

The results show Iceberg is currently managing:

One snapshot file which represents the append action as a result of inserting data into the table,
One manifest file containing the metadata about the data in the table and,
Five data files, each containing a single row.

Inserting More Data

Let’s insert more data to see how Iceberg handles a second write action.

# Second data insert
spark.sql("""
  INSERT INTO local.fragmentation.data_points VALUES 
    (6, 'metric_2', 5), 
    (7, 'metric_1', 10),
    (8, 'metric_2', 5), 
    (9, 'metric_1', 10),
    (10, 'metric_2', 5)
  """)

Let’s get an updated view of the snapshots, manifests, and data files managed by Iceberg for our table.

spark.sql("SELECT * FROM local.fragmentation.data_points.snapshots").show(truncate=False)
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|committed_at           |snapshot_id        |parent_id          |operation|manifest_list                                                                                                                  |summary                                                                                                                                                                                                                                                                                           |
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2024-06-18 12:37:25.354|5691300473162964711|NULL               |append   |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-5691300473162964711-1-900a94e9-7cd4-4b54-adcf-ab81302797ef.avro|{spark.app.id -> local-1718728023648, added-data-files -> 5, added-records -> 5, added-files-size -> 4324, changed-partition-count -> 1, total-records -> 5, total-files-size -> 4324, total-data-files -> 5, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}  |
|2024-06-18 12:47:31.789|3928157242535611215|5691300473162964711|append   |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-3928157242535611215-1-55caa6b0-d832-4368-917e-569eca2ddc17.avro|{spark.app.id -> local-1718728023648, added-data-files -> 5, added-records -> 5, added-files-size -> 4325, changed-partition-count -> 1, total-records -> 10, total-files-size -> 8649, total-data-files -> 10, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}|
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.manifests").show(truncate=False)
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path                                                                                                   |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/55caa6b0-d832-4368-917e-569eca2ddc17-m0.avro|6815  |0                |3928157242535611215|5                     |0                        |0                       |0                       |0                          |0                         |[]                 |
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/900a94e9-7cd4-4b54-adcf-ab81302797ef-m0.avro|6817  |0                |5691300473162964711|5                     |0                        |0                       |0                       |0                          |0                         |[]                 |
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.files").show(truncate=False)
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+----------------------------------------------------------------------------------------+
|content|file_path                                                                                                           |file_format|spec_id|record_count|file_size_in_bytes|column_sizes               |value_counts            |null_value_counts       |nan_value_counts|lower_bounds                                                            |upper_bounds                                                            |key_metadata|split_offsets|equality_ids|sort_order_id|readable_metrics                                                                        |
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+----------------------------------------------------------------------------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-12-29f4487c-b0df-426b-b860-47d17e1b6506-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [06 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|{1 -> [06 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 6, 6}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 5, 5}}  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00001-13-29f4487c-b0df-426b-b860-47d17e1b6506-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [07 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [0A 00 00 00]}|{1 -> [07 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 7, 7}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 10, 10}}|
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00002-14-29f4487c-b0df-426b-b860-47d17e1b6506-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [08 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|{1 -> [08 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 8, 8}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 5, 5}}  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00003-15-29f4487c-b0df-426b-b860-47d17e1b6506-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [09 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [0A 00 00 00]}|{1 -> [09 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 9, 9}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 10, 10}}|
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00004-16-29f4487c-b0df-426b-b860-47d17e1b6506-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [0A 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|{1 -> [0A 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 10, 10}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 5, 5}}|
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-4-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet |PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 1, 1}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 5, 5}}  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00001-5-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet |PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [02 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|{1 -> [02 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 2, 2}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 10, 10}}|
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00002-6-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet |PARQUET    |0      |1           |864               |{1 -> 35, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [03 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [03 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{35, 1, 0, NULL, 3, 3}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 5, 5}}  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00003-7-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet |PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [04 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|{1 -> [04 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 4, 4}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 10, 10}}|
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00004-8-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet |PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [05 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [05 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 5, 5}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 5, 5}}  |
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+----------------------------------------------------------------------------------------+

The results show Iceberg is now managing:

A second snapshot file which represents the append action of inserting additional data into the table,
A second manifest file containing the metadata about the new data in the table, and
Five more data files, each containing a single row.

Compacting Data

The example above is simple, but demonstrates how quickly the number of small data files for an Iceberg table which is frequently written to can grow. The solution to this issue is data compaction. Iceberg data compaction is the process of combining smaller files into larger files to improve query performance.

We can execute compaction on our table using the example below.

spark.sql("CALL local.system.rewrite_data_files('local.fragmentation.data_points')").show(truncate=False)
+--------------------------+----------------------+---------------------+-----------------------+
|rewritten_data_files_count|added_data_files_count|rewritten_bytes_count|failed_data_files_count|
+--------------------------+----------------------+---------------------+-----------------------+
|10                        |1                     |8649                 |0                      |
+--------------------------+----------------------+---------------------+-----------------------+

Take note of the DataFrame displayed as the result of executing data compaction:

Ten data files were rewritten, and
One new data file as created.

Let’s get an updated view of the snapshots, manifests, and data files managed by Iceberg for our table.

spark.sql("SELECT * FROM local.fragmentation.data_points.snapshots").show(truncate=False)
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|committed_at           |snapshot_id        |parent_id          |operation|manifest_list                                                                                                                  |summary                                                                                                                                                                                                                                                                                                                                   |
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2024-06-18 12:37:25.354|5691300473162964711|NULL               |append   |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-5691300473162964711-1-900a94e9-7cd4-4b54-adcf-ab81302797ef.avro|{spark.app.id -> local-1718728023648, added-data-files -> 5, added-records -> 5, added-files-size -> 4324, changed-partition-count -> 1, total-records -> 5, total-files-size -> 4324, total-data-files -> 5, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}                                          |
|2024-06-18 12:47:31.789|3928157242535611215|5691300473162964711|append   |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-3928157242535611215-1-55caa6b0-d832-4368-917e-569eca2ddc17.avro|{spark.app.id -> local-1718728023648, added-data-files -> 5, added-records -> 5, added-files-size -> 4325, changed-partition-count -> 1, total-records -> 10, total-files-size -> 8649, total-data-files -> 10, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}                                        |
|2024-06-18 12:53:45.424|4138898901876475658|3928157242535611215|replace  |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-4138898901876475658-1-ef0593c1-115c-4fce-9885-e11eef5f24b1.avro|{added-data-files -> 1, deleted-data-files -> 10, added-records -> 10, deleted-records -> 10, added-files-size -> 1009, removed-files-size -> 8649, changed-partition-count -> 1, total-records -> 10, total-files-size -> 1009, total-data-files -> 1, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}|
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.manifests").show(truncate=False)
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path                                                                                                   |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/ef0593c1-115c-4fce-9885-e11eef5f24b1-m2.avro|6763  |0                |4138898901876475658|1                     |0                        |0                       |0                       |0                          |0                         |[]                 |
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/ef0593c1-115c-4fce-9885-e11eef5f24b1-m1.avro|6816  |0                |4138898901876475658|0                     |0                        |5                       |0                       |0                          |0                         |[]                 |
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/ef0593c1-115c-4fce-9885-e11eef5f24b1-m0.avro|6819  |0                |4138898901876475658|0                     |0                        |5                       |0                       |0                          |0                         |[]                 |
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.all_data_files").show(truncate=False)
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|content|file_path                                                                                                           |file_format|spec_id|record_count|file_size_in_bytes|column_sizes               |value_counts               |null_value_counts       |nan_value_counts|lower_bounds                                                            |upper_bounds                                                            |key_metadata|split_offsets|equality_ids|sort_order_id|readable_metrics                                                                           |
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-20-3d51480d-2c9c-46dd-881c-62172d888f79-0-00001.parquet|PARQUET    |0      |10          |1009              |{1 -> 78, 2 -> 89, 3 -> 78}|{1 -> 10, 2 -> 10, 3 -> 10}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [0A 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{78, 10, 0, NULL, 1, 10}, {89, 10, 0, NULL, metric_1, metric_2}, {78, 10, 0, NULL, 5, 10}}|
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+

The results show Iceberg is now managing:

A third snapshot file which represents the replace action as a result of data compaction,
A third manifest file containing the metadata about the resulting data file, but
Only one data file containing 10 rows.

These results show data compaction was successful with our original 10 data files being compacted down to 1. Now when queries are executed against our table, only 1 data file will be accessed reducing I/O overhead.

Cleaning up Metadata

The previous section demonstrated the impact of frequent updates to an Iceberg table and the result data compaction has on consolidating a large number of small files into a fewer number of larger files. It also demonstrated how the number of manifest files increase with each write operation to the table.

Stale metadata in an Iceberg table can significantly impact the efficiency of data operations. Metadata in Iceberg includes information about table schema, partitioning, file locations, and snapshots, which are crucial for optimizing query performance. Stale metadata can cause inefficiencies in query planning and execution, resulting in suboptimal use of resources and increased query times. Regular metadata maintenance operations are essential to keep the metadata up-to-date, ensuring optimal performance. Without timely updates, the overall performance of the Iceberg table can degrade, hindering its effectiveness in handling large-scale data analytics.

Let’s start with the current state of the snapshots, manifests, and data files managed by Iceberg for our table.

spark.sql("SELECT * FROM local.fragmentation.data_points.snapshots").show(truncate=False)
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|committed_at           |snapshot_id        |parent_id          |operation|manifest_list                                                                                                                  |summary                                                                                                                                                                                                                                                                                                                                   |
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2024-06-18 12:37:25.354|5691300473162964711|NULL               |append   |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-5691300473162964711-1-900a94e9-7cd4-4b54-adcf-ab81302797ef.avro|{spark.app.id -> local-1718728023648, added-data-files -> 5, added-records -> 5, added-files-size -> 4324, changed-partition-count -> 1, total-records -> 5, total-files-size -> 4324, total-data-files -> 5, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}                                          |
|2024-06-18 12:47:31.789|3928157242535611215|5691300473162964711|append   |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-3928157242535611215-1-55caa6b0-d832-4368-917e-569eca2ddc17.avro|{spark.app.id -> local-1718728023648, added-data-files -> 5, added-records -> 5, added-files-size -> 4325, changed-partition-count -> 1, total-records -> 10, total-files-size -> 8649, total-data-files -> 10, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}                                        |
|2024-06-18 12:53:45.424|4138898901876475658|3928157242535611215|replace  |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-4138898901876475658-1-ef0593c1-115c-4fce-9885-e11eef5f24b1.avro|{added-data-files -> 1, deleted-data-files -> 10, added-records -> 10, deleted-records -> 10, added-files-size -> 1009, removed-files-size -> 8649, changed-partition-count -> 1, total-records -> 10, total-files-size -> 1009, total-data-files -> 1, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}|
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.manifests").show(truncate=False)
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path                                                                                                   |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/ef0593c1-115c-4fce-9885-e11eef5f24b1-m2.avro|6763  |0                |4138898901876475658|1                     |0                        |0                       |0                       |0                          |0                         |[]                 |
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/ef0593c1-115c-4fce-9885-e11eef5f24b1-m1.avro|6816  |0                |4138898901876475658|0                     |0                        |5                       |0                       |0                          |0                         |[]                 |
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/ef0593c1-115c-4fce-9885-e11eef5f24b1-m0.avro|6819  |0                |4138898901876475658|0                     |0                        |5                       |0                       |0                          |0                         |[]                 |
+-------+-------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.all_data_files").show(truncate=False)
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|content|file_path                                                                                                           |file_format|spec_id|record_count|file_size_in_bytes|column_sizes               |value_counts               |null_value_counts       |nan_value_counts|lower_bounds                                                            |upper_bounds                                                            |key_metadata|split_offsets|equality_ids|sort_order_id|readable_metrics                                                                           |
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-20-3d51480d-2c9c-46dd-881c-62172d888f79-0-00001.parquet|PARQUET    |0      |10          |1009              |{1 -> 78, 2 -> 89, 3 -> 78}|{1 -> 10, 2 -> 10, 3 -> 10}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [0A 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{78, 10, 0, NULL, 1, 10}, {89, 10, 0, NULL, metric_1, metric_2}, {78, 10, 0, NULL, 5, 10}}|
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+

Rewriting Manifests

We can rewrite our table’s manifests using the example below.

spark.sql("CALL local.system.rewrite_manifests('local.fragmentation.data_points')").show(truncate=False)
+-------------------------+---------------------+
|rewritten_manifests_count|added_manifests_count|
+-------------------------+---------------------+
|3                        |1                    |
+-------------------------+---------------------+

Again, take note of the DataFrame displayed as the result of executing a manifest rewrite:

Thirteen manifest files were rewritten, and
One new manifest file as created.

Let’s get an updated view of the snapshots, manifests, and data files managed by Iceberg for our table.

spark.sql("SELECT * FROM local.fragmentation.data_points.snapshots").show(truncate=False)
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|committed_at           |snapshot_id        |parent_id          |operation|manifest_list                                                                                                                  |summary                                                                                                                                                                                                                                                                                                                                   |
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2024-06-18 12:37:25.354|5691300473162964711|NULL               |append   |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-5691300473162964711-1-900a94e9-7cd4-4b54-adcf-ab81302797ef.avro|{spark.app.id -> local-1718728023648, added-data-files -> 5, added-records -> 5, added-files-size -> 4324, changed-partition-count -> 1, total-records -> 5, total-files-size -> 4324, total-data-files -> 5, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}                                          |
|2024-06-18 12:47:31.789|3928157242535611215|5691300473162964711|append   |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-3928157242535611215-1-55caa6b0-d832-4368-917e-569eca2ddc17.avro|{spark.app.id -> local-1718728023648, added-data-files -> 5, added-records -> 5, added-files-size -> 4325, changed-partition-count -> 1, total-records -> 10, total-files-size -> 8649, total-data-files -> 10, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}                                        |
|2024-06-18 12:53:45.424|4138898901876475658|3928157242535611215|replace  |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-4138898901876475658-1-ef0593c1-115c-4fce-9885-e11eef5f24b1.avro|{added-data-files -> 1, deleted-data-files -> 10, added-records -> 10, deleted-records -> 10, added-files-size -> 1009, removed-files-size -> 8649, changed-partition-count -> 1, total-records -> 10, total-files-size -> 1009, total-data-files -> 1, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}|
|2024-06-18 16:22:10.074|5886528569011072329|4138898901876475658|replace  |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-5886528569011072329-1-3fb53bbf-ddac-4e5a-8ea5-eb6dd0f84774.avro|{manifests-created -> 1, manifests-kept -> 0, manifests-replaced -> 3, entries-processed -> 0, changed-partition-count -> 0, total-records -> 10, total-files-size -> 1009, total-data-files -> 1, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}                                                     |
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.manifests").show(truncate=False)
+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path                                                                                                            |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/optimized-m-cb687f47-19d4-449c-a87d-c4018c73f625.avro|6764  |0                |5886528569011072329|0                     |1                        |0                       |0                       |0                          |0                         |[]                 |
+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.files").show(truncate=False)
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|content|file_path                                                                                                           |file_format|spec_id|record_count|file_size_in_bytes|column_sizes               |value_counts               |null_value_counts       |nan_value_counts|lower_bounds                                                            |upper_bounds                                                            |key_metadata|split_offsets|equality_ids|sort_order_id|readable_metrics                                                                           |
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-20-3d51480d-2c9c-46dd-881c-62172d888f79-0-00001.parquet|PARQUET    |0      |10          |1009              |{1 -> 78, 2 -> 89, 3 -> 78}|{1 -> 10, 2 -> 10, 3 -> 10}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [0A 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{78, 10, 0, NULL, 1, 10}, {89, 10, 0, NULL, metric_1, metric_2}, {78, 10, 0, NULL, 5, 10}}|
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+

The results show Iceberg is now managing:

A forth snapshot file which represents the replace action as a result of the manifest rewrite,
Only one manifest file exists vs. the three which existed before the rewrite, and
One data file containing 10 rows.

The results show the manifest rewrite was successful and our table now uses a single manifest file instead of three. Now when queries are executed against our table, only 1 manifest file will be accessed.

Were the Data Files and Manifests Deleted?

The previous sections demonstrated how to reduce the number of files accessed when Iceberg plans and executes a query against a table using data compaction and manifest rewrites. What happened to the data files and manifests which existed before these procedures were executed? Were they deleted?

Let’s get a view of the total number of data files and manifests being tracked by our Iceberg for our table.

spark.sql("SELECT * FROM local.fragmentation.data_points.all_data_files").show(truncate=False)
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|content|file_path                                                                                                           |file_format|spec_id|record_count|file_size_in_bytes|column_sizes               |value_counts               |null_value_counts       |nan_value_counts|lower_bounds                                                            |upper_bounds                                                            |key_metadata|split_offsets|equality_ids|sort_order_id|readable_metrics                                                                           |
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-20-3d51480d-2c9c-46dd-881c-62172d888f79-0-00001.parquet|PARQUET    |0      |10          |1009              |{1 -> 78, 2 -> 89, 3 -> 78}|{1 -> 10, 2 -> 10, 3 -> 10}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [0A 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{78, 10, 0, NULL, 1, 10}, {89, 10, 0, NULL, metric_1, metric_2}, {78, 10, 0, NULL, 5, 10}}|
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-20-3d51480d-2c9c-46dd-881c-62172d888f79-0-00001.parquet|PARQUET    |0      |10          |1009              |{1 -> 78, 2 -> 89, 3 -> 78}|{1 -> 10, 2 -> 10, 3 -> 10}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [0A 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{78, 10, 0, NULL, 1, 10}, {89, 10, 0, NULL, metric_1, metric_2}, {78, 10, 0, NULL, 5, 10}}|
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-4-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet |PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}   |{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 1, 1}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 5, 5}}     |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00001-5-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet |PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}   |{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [02 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|{1 -> [02 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 2, 2}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 10, 10}}   |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00002-6-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet |PARQUET    |0      |1           |864               |{1 -> 35, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}   |{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [03 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [03 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{35, 1, 0, NULL, 3, 3}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 5, 5}}     |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00003-7-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet |PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}   |{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [04 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|{1 -> [04 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 4, 4}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 10, 10}}   |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00004-8-3414896e-7b7f-43fa-a2f4-88acf403a9e4-0-00001.parquet |PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}   |{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [05 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [05 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 5, 5}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 5, 5}}     |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-12-29f4487c-b0df-426b-b860-47d17e1b6506-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}   |{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [06 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|{1 -> [06 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 6, 6}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 5, 5}}     |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00001-13-29f4487c-b0df-426b-b860-47d17e1b6506-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}   |{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [07 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [0A 00 00 00]}|{1 -> [07 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 7, 7}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 10, 10}}   |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00002-14-29f4487c-b0df-426b-b860-47d17e1b6506-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}   |{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [08 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|{1 -> [08 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 8, 8}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 5, 5}}     |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00003-15-29f4487c-b0df-426b-b860-47d17e1b6506-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}   |{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [09 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [0A 00 00 00]}|{1 -> [09 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 9, 9}, {44, 1, 0, NULL, metric_1, metric_1}, {36, 1, 0, NULL, 10, 10}}   |
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00004-16-29f4487c-b0df-426b-b860-47d17e1b6506-0-00001.parquet|PARQUET    |0      |1           |865               |{1 -> 36, 2 -> 44, 3 -> 36}|{1 -> 1, 2 -> 1, 3 -> 1}   |{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [0A 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|{1 -> [0A 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [05 00 00 00]}|NULL        |[4]          |NULL        |0            |{{36, 1, 0, NULL, 10, 10}, {44, 1, 0, NULL, metric_2, metric_2}, {36, 1, 0, NULL, 5, 5}}   |
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.all_manifests").show(truncate=False)+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+
|content|path                                                                                                            |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|reference_snapshot_id|
+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/900a94e9-7cd4-4b54-adcf-ab81302797ef-m0.avro         |6817  |0                |5691300473162964711|5                     |0                        |0                       |0                       |0                          |0                         |[]                 |5691300473162964711  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/55caa6b0-d832-4368-917e-569eca2ddc17-m0.avro         |6815  |0                |3928157242535611215|5                     |0                        |0                       |0                       |0                          |0                         |[]                 |3928157242535611215  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/900a94e9-7cd4-4b54-adcf-ab81302797ef-m0.avro         |6817  |0                |5691300473162964711|5                     |0                        |0                       |0                       |0                          |0                         |[]                 |3928157242535611215  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/ef0593c1-115c-4fce-9885-e11eef5f24b1-m2.avro         |6763  |0                |4138898901876475658|1                     |0                        |0                       |0                       |0                          |0                         |[]                 |4138898901876475658  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/ef0593c1-115c-4fce-9885-e11eef5f24b1-m1.avro         |6816  |0                |4138898901876475658|0                     |0                        |5                       |0                       |0                          |0                         |[]                 |4138898901876475658  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/ef0593c1-115c-4fce-9885-e11eef5f24b1-m0.avro         |6819  |0                |4138898901876475658|0                     |0                        |5                       |0                       |0                          |0                         |[]                 |4138898901876475658  |
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/optimized-m-cb687f47-19d4-449c-a87d-c4018c73f625.avro|6764  |0                |5886528569011072329|0                     |1                        |0                       |0                       |0                          |0                         |[]                 |5886528569011072329  |
+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+

The results show Iceberg is still tracking:

Twelve data files, and
Seven manifest files.

While these files are not used when planning and executing queries, they are consuming storage.

Expiring Snapshots

Iceberg continues to track the data and manifest files because they are referenced in the table’s snapshots. Iceberg table snapshots serve the purpose of capturing and preserving the state of a table at specific points in time. These snapshots are fundamental for ensuring data consistency and enabling advanced data management features such as time travel and incremental data processing. By maintaining a record of each snapshot, Iceberg allows users to query historical versions of the data, which is invaluable for auditing, debugging, and understanding changes over time. Snapshots also facilitate safe schema evolution and data updates by isolating changes until they are fully committed. This isolation helps prevent data corruption and ensures that queries always run against a consistent view of the data. Overall, snapshots provide a robust mechanism for managing data lifecycle and integrity in an Iceberg table.

Care, thought, and planning should go into your strategy for managing snapshots as deleting snapshots will impact Iceberg features like time travel. As an example, let’s delete all but the most recent snapshot in our Iceberg table.

spark.sql("CALL local.system.expire_snapshots('local.fragmentation.data_points', TIMESTAMP '2099-01-01 00:00:00.000', 1)").show(truncate=False)
+------------------------+-----------------------------------+-----------------------------------+----------------------------+----------------------------+------------------------------+
|deleted_data_files_count|deleted_position_delete_files_count|deleted_equality_delete_files_count|deleted_manifest_files_count|deleted_manifest_lists_count|deleted_statistics_files_count|
+------------------------+-----------------------------------+-----------------------------------+----------------------------+----------------------------+------------------------------+
|10                      |0                                  |0                                  |5                           |3                           |0                             |
+------------------------+-----------------------------------+-----------------------------------+----------------------------+----------------------------+------------------------------+

Again, take note of the DataFrame displayed as the result of expiring snapshots:

Ten data files were delete, and
Eight manifest related files were deleted.

Let’s get an updated view of the snapshots, manifests, and data files managed by Iceberg for our table.

spark.sql("SELECT * FROM local.fragmentation.data_points.snapshots").show(truncate=False)
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|committed_at           |snapshot_id        |parent_id          |operation|manifest_list                                                                                                                  |summary                                                                                                                                                                                                                                                                              |
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|2024-06-18 16:22:10.074|5886528569011072329|4138898901876475658|replace  |spark-warehouse/iceberg/fragmentation/data_points/metadata/snap-5886528569011072329-1-3fb53bbf-ddac-4e5a-8ea5-eb6dd0f84774.avro|{manifests-created -> 1, manifests-kept -> 0, manifests-replaced -> 3, entries-processed -> 0, changed-partition-count -> 0, total-records -> 10, total-files-size -> 1009, total-data-files -> 1, total-delete-files -> 0, total-position-deletes -> 0, total-equality-deletes -> 0}|
+-----------------------+-------------------+-------------------+---------+-------------------------------------------------------------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.manifests").show(truncate=False)
+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|content|path                                                                                                            |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|
+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/optimized-m-cb687f47-19d4-449c-a87d-c4018c73f625.avro|6764  |0                |5886528569011072329|0                     |1                        |0                       |0                       |0                          |0                         |[]                 |
+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.files").show(truncate=False)
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|content|file_path                                                                                                           |file_format|spec_id|record_count|file_size_in_bytes|column_sizes               |value_counts               |null_value_counts       |nan_value_counts|lower_bounds                                                            |upper_bounds                                                            |key_metadata|split_offsets|equality_ids|sort_order_id|readable_metrics                                                                           |
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-20-3d51480d-2c9c-46dd-881c-62172d888f79-0-00001.parquet|PARQUET    |0      |10          |1009              |{1 -> 78, 2 -> 89, 3 -> 78}|{1 -> 10, 2 -> 10, 3 -> 10}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [0A 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{78, 10, 0, NULL, 1, 10}, {89, 10, 0, NULL, metric_1, metric_2}, {78, 10, 0, NULL, 5, 10}}|
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+

The results show our table now has one snapshot, one manifest, and one data file.

Let’s get a view of the total number of data and manifest being tracked by our Iceberg table after expiring the snapshots.

spark.sql("SELECT * FROM local.fragmentation.data_points.all_data_files").show(truncate=False)
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|content|file_path                                                                                                           |file_format|spec_id|record_count|file_size_in_bytes|column_sizes               |value_counts               |null_value_counts       |nan_value_counts|lower_bounds                                                            |upper_bounds                                                            |key_metadata|split_offsets|equality_ids|sort_order_id|readable_metrics                                                                           |
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/data/00000-20-3d51480d-2c9c-46dd-881c-62172d888f79-0-00001.parquet|PARQUET    |0      |10          |1009              |{1 -> 78, 2 -> 89, 3 -> 78}|{1 -> 10, 2 -> 10, 3 -> 10}|{1 -> 0, 2 -> 0, 3 -> 0}|{}              |{1 -> [01 00 00 00], 2 -> [6D 65 74 72 69 63 5F 31], 3 -> [05 00 00 00]}|{1 -> [0A 00 00 00], 2 -> [6D 65 74 72 69 63 5F 32], 3 -> [0A 00 00 00]}|NULL        |[4]          |NULL        |0            |{{78, 10, 0, NULL, 1, 10}, {89, 10, 0, NULL, metric_1, metric_2}, {78, 10, 0, NULL, 5, 10}}|
+-------+--------------------------------------------------------------------------------------------------------------------+-----------+-------+------------+------------------+---------------------------+---------------------------+------------------------+----------------+------------------------------------------------------------------------+------------------------------------------------------------------------+------------+-------------+------------+-------------+-------------------------------------------------------------------------------------------+

spark.sql("SELECT * FROM local.fragmentation.data_points.all_manifests").show(truncate=False)
+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+
|content|path                                                                                                            |length|partition_spec_id|added_snapshot_id  |added_data_files_count|existing_data_files_count|deleted_data_files_count|added_delete_files_count|existing_delete_files_count|deleted_delete_files_count|partition_summaries|reference_snapshot_id|
+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+
|0      |spark-warehouse/iceberg/fragmentation/data_points/metadata/optimized-m-cb687f47-19d4-449c-a87d-c4018c73f625.avro|6764  |0                |5886528569011072329|0                     |1                        |0                       |0                       |0                          |0                         |[]                 |5886528569011072329  |
+-------+----------------------------------------------------------------------------------------------------------------+------+-----------------+-------------------+----------------------+-------------------------+------------------------+------------------------+---------------------------+--------------------------+-------------------+---------------------+

These results show that expiring snapshots also deletes the data files and manifests associated to those snapshots.

Removing Orphaned Files

From time to time, concurrent operations on an Iceberg table can results in orphaned files, or files which exists in storage but are not referenced by manifests.

Let’s try to remove orphaned files from our table.

spark.sql("CALL local.system.remove_orphan_files('local.fragmentation.data_points')").show(truncate=False)
+--------------------+
|orphan_file_location|
+--------------------+
+--------------------+

The results show our table contained no orphaned files, but if it had they would have need deleted, reclaiming the storage they consumed.

Conclusion

Efficient table maintenance is vital for ensuring the performance and reliability of your data infrastructure. Apache Iceberg, combined with PySpark, offers powerful tools to manage and optimize your datasets. Regularly performing tasks like compaction, expiring old snapshots and removing orphan files will help keep your tables in top shape, ensuring fast queries and efficient storage usage.

The examples in this article use the most basic form of demonstrated procedures, but each procedure provides configuration options which tailor how the procedure impact the target table. Additional information can be found in the Apache Iceberg documentation.

Apache Iceberg Table Maintenance using PySpark

Why Table Maintenance Matters

The Small File Problem

Setting up the Environment

Inserting Data

Inserting More Data

Compacting Data

Cleaning up Metadata

Rewriting Manifests

Were the Data Files and Manifests Deleted?

Expiring Snapshots

Removing Orphaned Files

Conclusion

Written by Thomas Lawless