Mastering Delta Lake Optimizations: OPTIMIZE, Z-ORDER and VACUUM

Published in

Data Reply IT | DataTech

14 min readSep 28, 2023

Introduction

In the vast landscape of big data, where the volume of information seems to expand infinitely, Delta Lake stands as a beacon of reliability and consistency. Born from the need to maintain data integrity in data lakes, Delta Lake brings ACID transactions and reliability to the world of big data processing.

Yet, as big data continues to grow exponentially, so too do the challenges surrounding storage costs, parallelism, and operational efficiency.

In this blog post, we embark on a journey into the heart of Delta Lake optimization techniques — Optimize, Z-Order, and Vacuum. These techniques are not just optimization tools; they are the architects of efficiency in the big data realm. They possess the power to sculpt your Delta Lake into a lean, mean data processing machine, slashing storage costs, enhancing parallelism, and reducing the operational load on your infrastructure.

They offer not just solutions but gateways to a future where your Delta Lake operates seamlessly and cost-effectively in the realm of big data.

So, let’s embark on this exploration of OPTIMIZE, Z-ORDER, and VACUUM, and discover how they can reshape the way you handle big data.

Before starting…

Before starting to talk about optimizations, it is necessary to remember what are the main components of a high-level delta lake. Although optimizations work at a low level by manipulating data, they will always reflect their results on a higher level. Understanding how a delta lake is made in terms of components allows us to know if the optimizations are working.

Contents of a Delta Table from Delta.io Blog

A Delta Table contains two crucial components: Parquet files and Delta Logs.

Parquet files represent the actual data stored within the table, organized in a highly efficient columnar format. This format ensures data compression and encoding optimizations, making it ideal for fast and cost-effective data storage and retrieval.

On the other hand, Delta Logs are transaction logs that record every operation performed on the data, such as inserts, updates, and deletes. These logs provide a detailed history of changes made to the table, ensuring data versioning, auditability, and data lineage. The combination of Parquet files and Delta Logs within a Delta Table offers a robust and reliable platform for managing, querying, and maintaining large-scale data, enabling features like data rollback, time travel queries, and data consistency in a Delta Lake environment.

Here’s an example of a simple Delta Lake partitioned by year, month, and day:

-- Create a Delta Lake table with partitioning
CREATE TABLE delta_table
USING delta
PARTITIONED BY (year INT, month INT, day INT)
LOCATION '/path/to/delta-lake/';

-- Insert data into the Delta Lake table
INSERT INTO delta_table
VALUES ('Alice', 25, 2023, 9, 1),
       ('Bob', 30, 2023, 9, 1),
       ('Charlie', 22, 2023, 9, 2);

With its folder structure containing Delta log and Parquet files:

delta-lake/
    ├── delta-log/
    │   ├── 000000.json
    │   ├── 000001.json
    │   ├── ...
    │   ├── 0000XX.json
    │
    ├── year=2023/
    │   ├── month=01/
    │   │   ├── day=01/
    │   │   │   ├── data.parquet
    │   │   │   ├── ...
    │   │   │   ├── data.parquet
    │   │   │
    │   │   ├── day=02/
    │   │   │   ├── data.parquet
    │   │   │   ├── ...
    │   │   │   ├── data.parquet
    │   │   │
    │   │   ├── ...
    │   │
    │   ├── month=02/
    │   ├── ...

In this folder structure, we have:

delta-log: This directory contains Delta Lake transaction logs (*.json files) that track all the changes made to the Delta Lake table.
year=2023: This is the top-level directory for partitioning by year. Each subdirectory inside this directory represents a year, e.g., year=2023, and contains data for that specific year.
month=01, month=02, etc.: Inside each year's directory, there are subdirectories for partitioning by month. Each subdirectory represents a month and contains data for that specific month within the corresponding year.
day=01, day=02, etc.: Inside each month's directory, there are subdirectories for partitioning by day. Each subdirectory represents a day and contains data for that specific day within the corresponding month.
data.parquet: These are Parquet files containing the actual data for each day. They are partitioned by year, month, and day.

Partitioning is a crucial technique in data storage systems like Delta Lake that helps improve query performance, data organization, and overall data management efficiency. You should partition your data as much as you can since this operation is presented as a best practice in the official Delta Lake documentation.

Understanding Delta Lake Optimization Techniques

It’s time to see the three optimization techniques: OPTIMIZE, Z-ORDER and VACUUM.
Each technique is first described at a high level to understand what it is used for, how it works and what the effects generated are.
Subsequently, examples are introduced by the official documentation that I always suggest looking to deepen the topic!

OPTIMIZE

The Optimize operation in Delta Lake is like a digital janitor for your data. It works by cleaning up and reorganizing your data files, making them more efficient for query processing.

Here’s how it works:

File Compaction: Over time, as data is added, updated, and deleted in your Delta Lake, it can lead to fragmentation, resulting in a multitude of small data files. These tiny files can hinder query performance, as the query engine has to scan many files. Optimize comes to the rescue by compacting these small files into larger, more manageable ones.

Metadata Cleanup: Optimize doesn’t just handle data files; it also takes care of your metadata. It removes unnecessary entries in the metadata, ensuring that only relevant information is retained. This helps in keeping your Delta Lake’s metadata lean and efficient.

Statistics Update: Optimize also updates statistics about the data, which can be crucial for query optimization. By having accurate statistics, the query optimizer can make more informed decisions about how to execute queries efficiently.

In essence, the Optimize operation is your ally in maintaining a well-organized and efficient Delta Lake. It reduces storage overhead, improves query performance, and ensures your metadata remains tidy.

Let’s see an example to understand better of it works starting with a simple delta table, not partitioned and composed by 5 parquets files:

tmp/table1
├── _delta_log
│   └── 00000000000000000000.json
├── part-00000-4dc9742a-7c76-4527-a2c6-d7904f56d05d-c000.snappy.parquet
├── part-00001-18179b66-48f0-4d47-8f21-762678a7df66-c000.snappy.parquet
├── part-00002-03f9116e-189e-4e55-bfe7-d501fffe4ced-c000.snappy.parquet
└── part-00003-81cad732-eeaf-4708-9111-aa2e8136e304-c000.snappy.parquet

Now we can apply the OPTIMIZE operation simply by writing this code snippet:

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "tmp/table1")

delta_table.optimize().executeCompaction()

To see the effect of this optimization we should vacuum the delta table (we will see this in detail later) with this code:

spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
delta_table.vacuum(0)

If you take a look into the delta table folder we can see how the OPTIMIZE operation worked:

tmp/table1
├── _delta_log
│   ├── 00000000000000000000.json
│   └── 00000000000000000001.json
└── part-00000-2b164d47-0f85-47e1-99dc-d136c784baaa-c000.snappy.parquet

In the end, we obtained a new Delta Table with all the data in a single parquet file, replacing the previous 5 files.

Notice that the delta_log folder has a new file that contains the operations performed by the optimize operation:

{
  "add": {
    "path": "part-00000-2b164d47-0f85-47e1-99dc-d136c784baaa-c000.snappy.parquet",
    "partitionValues": {},
    "size": 504,
    "modificationTime": 1671468211798,
    "dataChange": false,
    "stats": "{\"numRecords\":5,\"minValues\":{\"id\":0},\"maxValues\":{\"id\":4},\"nullCount\":{\"id\":0}}"
  }
}
{
  "remove": {
    "path": "part-00003-81cad732-eeaf-4708-9111-aa2e8136e304-c000.snappy.parquet",
    "deletionTimestamp": 1671468210066,
    "dataChange": false,
    "extendedFileMetadata": true,
    "partitionValues": {},
    "size": 478
  }
}
The other small files are also removed…

The OPTIMIZE command tombstoned all of the small data files and added a larger file with all the compacted data. Notice that dataChange is set to false for all the files that are added and removed.

A compaction operation does not add any new data to the Delta table. It just rearranges existing data from small files into larger files. This is useful when you have downstream jobs that watch your Delta table and incrementally update when new files arrive. The dataChange flag differentiates between data that’s been rearranged into larger files for compaction purposes and brand-new data that’s been ingested into your Delta table.

Please notice that in this simple example, the Delta Table is not partitioned.

Using the optimize operation on a partitioned Delta Table consists of optimizing the table partition by partition. If you want to optimize only one specific partition you can use the where clause with the optimize function.

For further details about OPTIMIZE operation, you can read the official documentation.

Z-ORDER

Z-Order is a data organization technique that plays a crucial role in optimizing query performance. It revolves around the idea of physically grouping related data together. When you apply Z-Order to a Delta Lake table, you’re essentially encoding a specific column’s values into a single key, which then determines the physical layout of data on storage. This strategic organization minimizes data shuffling during query execution.

Here’s how Z-Order works: Imagine you have a sales dataset, and you apply Z-Order on the “product_id” column. Delta Lake then sorts and stores all rows with the same “product_id” together, making it ideal for queries that filter or aggregate data by product. When you query for sales data related to a specific product, Delta Lake can efficiently retrieve only the relevant data, greatly improving query performance.

Z-ORDER Operation on Delta Table from Delta.io Blog

In practical terms, the optimization via Z-ORDER carries out two operations:

Order the data (in parquet files) according to the chosen column/s
Populate the delta log, mapping for each parquet the statistical values at least and maximum

In this way, during a query just consult the delta log to understand which files to read, avoiding reading useless files.

In summary, Z-Order is your ally when dealing with large datasets and analytical workloads. By reducing unnecessary data scanning, it ensures your queries run faster and more efficiently, providing a significant boost to your Delta Lake’s overall performance.

Suppose you have a one billion-row data set with nine columns as follows:

+-----+-----+------------+---+---+------+---+---+---------+
|  id1|  id2|         id3|id4|id5|   id6| v1| v2|       v3|
+-----+-----+------------+---+---+------+---+---+---------+
|id016|id046|id0000109363| 88| 13|146094|  4|  6|18.837686|
|id039|id087|id0000466766| 14| 30|111330|  4| 14|46.797328|
|id095|id078|id0000584803| 56| 92|213320|  1|  9|63.464315|
+-----+-----+------------+---+---+------+---+---+---------+

You’d like to run the following query: select id1, sum(v1) as v1 from the_table where id1 = 'id016' group by id1.

The Delta table is initially stored in 395 files, and the rows with id1 = 'id016' are dispersed throughout all the files, so none of the files can be skipped when the query is run.

(
    spark.read.format("delta")
    .option("versionAsOf", "0")
    .load(delta_path)
    .createOrReplaceTempView("x0")
)

spark.sql(
    "select id1, sum(v1) as v1 from x0 where id1 = 'id016' group by id1"
).collect()

It takes 4.51 seconds to run the query on the Delta table.

Now let’s apply optimize operation and compact all the files:

delta_table = DeltaTable.forPath(spark, delta_path)
delta_table.optimize().executeCompaction()

This optimization produces a new version of the delta table optimized in a smaller number of parquet files.

We can apply the same query:

(
    spark.read.format("delta")
    .option("versionAsOf", "1")
    .load(delta_path)
    .createOrReplaceTempView("x1")
)

spark.sql(
    "select id1, sum(v1) as v1 from x1 where id1 = 'id016' group by id1"
).collect()

The query now only takes 4.33 seconds to run.

Let’s now Z Order the data by id1, so the rows with id1 = 'id016' are grouped together and not spread across all the files.

(
    delta.DeltaTable.forPath(spark, table_path)
    .optimize()
    .executeZOrderBy("id1")
)

Note that the zOrder operation inherits the OPTIMIZE function behaviour. It first orders all the data according to the specified z-ordered column and then compacts the files.

Now we can apply the same query to our third version of the Delta Table:

(
    spark.read.format("delta")
    .option("versionAsOf", "2")
    .load(delta_path)
    .createOrReplaceTempView("x2")
)

spark.sql(
    "select id1, sum(v1) as v1 from x2 where id1 = 'id016' group by id1"
).collect()

The query now takes 0.6 seconds to execute. Impressive!

As a result, we obtain that Z Ordering data can significantly improve query performance. This graph shows the query time by Delta table version:

For further details about the Z-ORDER operation, you can read the official documentation.

VACUUM

Vacuum is a vital operation for maintaining the health of your Delta Lake. Over time, as data changes, gets updated, or is deleted, it can leave behind obsolete data files that take up valuable storage space. The Vacuum operation comes to the rescue by cleaning up these unnecessary files.

Here’s why Vacuum is essential:

Storage Reclamation: By removing obsolete files, Vacuum reclaims storage space, reducing your storage costs. This is particularly crucial as data lakes grow over time and can become a significant expense.
Metadata Cleanup: Vacuum not only deletes data files but also cleans up the associated metadata. This ensures that your Delta Lake’s metadata remains efficient and doesn’t get bloated with information about files that are no longer relevant.
Performance Optimization: Keeping your Delta Lake lean and clutter-free through regular Vacuum operations ensures that query performance remains snappy. The query engine doesn’t waste time scanning through outdated or redundant files.

In essence, Vacuum is your maintenance tool, ensuring that your Delta Lake operates efficiently over the long term. It’s a simple yet crucial operation to keep your data lake healthy and cost-effective.

Suppose you have a delta table like this one:

df = spark.createDataFrame([("bob", 3), ("sue", 5)]).toDF("first_name", "age")
df.repartition(1).write.format("delta").saveAsTable("some_people")

spark.table("some_people").show()

+----------+---+
|first_name|age|
+----------+---+
|       bob|  3|
|       sue|  5|
+----------+---+

Now we append 2 more rows of data to the Delta Table:

df = spark.createDataFrame([("ingrid", 58), ("luisa", 87)]).toDF("first_name", "age")
df.repartition(1).write.format("delta").mode("append").saveAsTable("some_people")

spark.table("some_people").show()

+----------+---+
|first_name|age|
+----------+---+
|    ingrid| 58|
|     luisa| 87|
|       bob|  3|
|       sue|  5|
+----------+---+

Here are the files currently in the Delta table:

some_people
├── _delta_log
│   ├── 00000000000000000000.json
│   └── 00000000000000000001.json
├── part-00000-0e9cf175-b53d-4a1f-b132-8f71eacee991-c000.snappy.parquet
└── part-00000-9ced4666-4b26-4516-95d0-6e27bc2448e7-c000.snappy.parquet

Let’s overwrite the existing Delta table with some new data, which will tombstone the existing data. Overwrite operations mark all the existing data for removal in the transaction log (aka “tombstones all the existing files”).

df = spark.createDataFrame([("jordana", 26), ("fred", 25)]).toDF("first_name", "age")
df.repartition(1).write.format("delta").mode("overwrite").saveAsTable("some_people")

spark.table("some_people").show()

+----------+---+
|first_name|age|
+----------+---+
|   jordana| 26|
|      fred| 25|
+----------+---+

As you can see in the following image, the overwrite operation has added a new file and tombstoned the existing files.

But… if we check the files inside our Delta Table directory we discover that all the files are still present in our folder:

spark-warehouse/some_people
├── _delta_log
│   ├── 00000000000000000000.json
│   ├── 00000000000000000001.json
│   └── 00000000000000000002.json
├── part-00000-0e9cf175-b53d-4a1f-b132-8f71eacee991-c000.snappy.parquet
├── part-00000-1009797a-564f-4b0c-8035-c45354018f21-c000.snappy.parquet
└── part-00000-9ced4666-4b26-4516-95d0-6e27bc2448e7-c000.snappy.parquet

The reason that the files belonging to the old version of the Delta are still present lies in a property of Delta Lake: the Time Travelling. No file is physically deleted during insert, delete or update operations. This behaviour allows you to browse different versions of the delta table and be able to analyze its changes over time (we have seen how to do it in the examples in the paragraph relating to the Z-Order).
The drawback of this property is the large amount of storage used.

This is exactly where the vacuum operation comes into play!

Let’s call the vacuum function:

#Without set the spark configuartion in this way the VACUUM will delete only the parquets belonging to a version of the Delta Table older than 7 days
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

#Call VACUUM
spark.sql("VACUUM some_people RETAIN 0 HOURS").show(truncate=False)

This command returns a DataFrame that shows the files that have been removed from storage.

…/part-00000–0e9cf175-b53d-4a1f-b132–8f71eacee991-c000.snappy.parquet
…//part-00000–9ced4666–4b26–4516–95d0–6e27bc2448e7-c000.snappy.parquet

List the files to confirm that they’ve been removed from storage:

spark-warehouse/some_people
├── _delta_log
│   ├── 00000000000000000000.json
│   ├── 00000000000000000001.json
│   └── 00000000000000000002.json
└── part-00000-1009797a-564f-4b0c-8035-c45354018f21-c000.snappy.parquet

The VACUUM operation correctly deletes the parquet files that don’t belong to the last version of the Delta Table.

Some (really) useful remarks about VACUUM

Please note that to execute the operation we set the following configuration:

spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")

The official documentation discourages to use of this configuration since the VACUUM operation is not reversible!

The VACUUM, by default, deletes all the files belonging to a version of Delta Table older than 7 days. This is an excellent trade-off to time travel your recent versions!

Using VACUUM limits your ability to time-travel your data so before using this operation (to save storage space) think about that!

For further details about the VACUUM operation, you can read the official documentation.

A real scenario: Why should I use these techniques?

When we talk about optimizations we very often think of a style exercise to make our code or infrastructure cleaner and cooler.

In the world of big data, however, this is not the case.

In most cases, the architectures are deployed in the cloud using providers such as AWS, Google Cloud or Microsoft Azure.

Most cloud models and components follow a PAY-AS-YOU-GO paradigm: basically, we pay for what we use and therefore processing and storage time.

Let’s take for example the case of an AWS-based infrastructure on which we implement our delta lake:

Parquet files are stored in S3 buckets (Storage Cost)
Table queries are done via Redshift or Athena (Computation cost)
Insert, update or delete operations are performed via GlueJob (Computation cost)
Operation logs are recorded via CloudWatch for system security reasons (Storage Cost, and please don’t think they are negligible!)

What are the consequences of using optimizations on the Delta Table?

The VACUUM operation allows us to remove all files that are no longer needed and no longer participating in recent versions of the Delta Table providing a storage cost gain in S3 buckets.

OPTIMIZE and Z-ORDER reduce the number of readings and writings in parquet files and therefore the number of operations, reducing both processing times and the number of saved logs. In this way, we have saved Redshift and GlueJob processing times but also the storage logs saved on CloudWatch.

Now surely it will be clearer to you why consider immediately the optimizations of your DeltaTable :)

Challenges and Considerations

While these optimization techniques offer substantial benefits, they’re not without challenges. Optimize might incur some overhead during its execution, Z-Order requires careful column selection, and Vacuum needs to be scheduled and configured correctly. Additionally, balancing optimization with frequent data modifications requires thoughtful planning.

Future Trends in Delta Lake Optimization

Briefly touch upon emerging trends in Delta Lake optimization. Mention advancements in automated optimization algorithms and machine learning. Predictions on how Delta Lake itself might evolve to incorporate more built-in optimization features.

Conclusion

Optimizing a Delta Lake is an ongoing journey that requires a deep understanding of your data, workloads, and optimization techniques.

By leveraging Optimize, Z-Order, and Vacuum, you can unlock remarkable performance improvements, cost savings, and operational efficiencies.

In this post, we see how to optimize our Delta Table through operations like OPTIMIZE, Z-ORDER and VACUUM. Furthermore, we discuss how an entire infrastructure could benefit from these operations in a real case example.

Remember, there’s no one-size-fits-all solution; the key is to tailor your optimization strategy to your unique data landscape.
So, dive in, experiment, and let these techniques transform the way you manage and process data in your Delta Lake.

I strongly suggest reading the official documentation and blog post on delta.io (the code of this post is taken directly from the official blog).

If you enjoy this little journey around the core implementation of a Delta Table please, CLAP YOUR HANDS 👏!