Understanding Delta Lake: Bridging Data Lakes and Warehouses

7 min readMay 16, 2024

Introduction

In the era of cloud computing, cloud object storage solutions like Amazon S3 and Google Cloud Storage have revolutionized data management, providing benefits such as unlimited scalability, pay-as-you-go billing, durability, and reliability. Consequently, many organizations now use cloud object stores to manage large structured datasets in data lakes and data warehouses. Open-source systems like Apache Spark and Presto can read and write to cloud object stores using file formats such as Apache Parquet and ORC. Moreover, cloud services like Google BigQuery and Redshift Spectrum can query directly against these systems using these open file formats.

However, despite these advantages, cloud object storage has limitations when applied to Lakehouse architecture, which requires data warehouse capabilities on top of a data lake. Key challenges include achieving ACID transactions, high performance, and efficient metadata operations. To address these challenges, Databricks designed Delta Lake, an ACID table storage layer on cloud object storage, which was released to customers in 2017 and open-sourced in 2019.

Motivation Behind Delta Lake

Object Store APIs

Cloud object stores implement a key-value store with a focus on scalability. They allow users to create buckets that store multiple objects identified by string keys. Unlike traditional file systems, these stores do not support directories; the path-like structure in object keys (e.g.,“/data/country=us/date=2024–04–27/object_name”) is merely a naming convention. Metadata APIs, such as S3’s LIST operation, can list objects in a bucket based on a prefix, but these operations are typically expensive and have high latency. For example, S3’s LIST operation returns up to 1000 keys per call, and each call can take tens to hundreds of milliseconds, making it time-consuming to list large tables with millions of objects.

Consistency Properties

Most cloud object stores support eventual consistency for individual objects but not across multiple objects. This means that after an object is modified, other clients may not immediately see the changes in LIST or read operations. Google Cloud Storage, for example, provides strong global consistency for various operations but lacks atomic operations across keys. This inconsistency can complicate managing tables comprising multiple objects, as updates are not immediately visible to all clients.

Performance Considerations

Achieving high throughput with object stores requires a balance of large sequential I/Os and parallelism. For reads, sequential operations typically have latencies of 5–10 ms and can achieve data transfer rates of 50–100 MB/s. To maximize throughput, applications need to read several hundred kilobytes to multiple megabytes at a time and run multiple reads in parallel. LIST operations also require significant parallelism to quickly process large sets of objects. For writes, objects must be replaced entirely, making updates to large objects costly.

Existing Approaches

Three popular methods for handling relational data in object storage include:

Directories of Files: Storing tables as collections of objects using formats like Parquet, organized into partition directories (e.g., /data/country=us/date=2024–04–27/object_name). This reduces LIST operation costs but lacks atomicity across multiple objects and has poor performance for complex operations.
Custom Storage Engines: These engines manage metadata in a strongly consistent service, treating object storage as a block device. This approach is used by cloud data warehouses like Snowflake and Google BigQuery but is tied to specific service providers.
Metadata in Object Stores: Delta Lake’s approach stores transaction logs and metadata directly in the cloud object store, achieving serializability using a set of protocols over object store operations. Table data is stored in Parquet format, facilitating access from various software tools.

Delta Lake Storage Format

A Delta Lake table consists of data objects and a transaction log, stored in a cloud object storage directory.

Data Objects

Data is stored in Apache Parquet objects, organized into directories using Hive’s partition convention. Parquet is a widely used columnar format supporting various compression schemes and nested data types, simplifying connector development for Delta Lake.

Transaction Log

The transaction log is stored in a sub-directory (_delta_log) within the table directory. It comprises JSON objects with sequential numerical IDs (e.g., 000003), containing actions to apply to the table’s previous version to generate the next one. Actions include:

Change metadata: Alters the table’s current metadata, including schema, partition columns, and data file formats.
Add or Remove Files: Modifies the table by adding or removing data objects, with optional statistics such as min/max values or null counts.
Protocol Evolution: Updates the Delta protocol version.
Add Provenance Information: Includes commit information.
Update Application Transaction IDs: Helps track application information, aiding in end-to-end transactional applications.

Log Checkpoints

Delta Lake periodically compresses the log into Parquet checkpoint files, summarizing actions up to a specific log ID. Checkpoints improve performance by storing non-redundant actions, making it easier to find data objects and metadata without extensive LIST operations.

Access Protocols

Reading from Tables

Reading a Delta Lake table involves several steps:

Read the _last_checkpoint object: Identifies the most recent checkpoint.
Issue a LIST operation: Finds newer JSON and Parquet log files.
Reconstruct the table state: Uses the checkpoint and subsequent log records.
Identify necessary data objects: Uses statistics for the read query.
Read data objects in parallel: Handles eventual consistency by retrying if objects are not immediately available.

Write Transactions

Writing to a Delta Lake table involves:

Identify a recent log record ID: Using the read protocol.
Read the data at the current version.
Write new data objects: In parallel, generating GUIDs for object names.
Write the transaction log record: Attempting an atomic write.
Write a new checkpoint: Optionally update the _last_checkpoint file.

To ensure atomicity, Databricks uses different methods based on storage systems, such as atomic put-if-absent operations for Google Cloud Storage and Azure Blob Store, atomic renames for HDFS, and a coordination service for Amazon S3.

Available Isolation Levels

Delta Lake supports serializable transactions for write operations, ensuring a serial schedule with increasing log IDs. Read transactions can achieve snapshot isolation or serializability, with implementations often caching log record IDs to ensure consistency.

Higher-Level Data Management Features

Time Travel and Rollbacks

Delta Lake supports querying historical data snapshots using the transaction log’s immutability, facilitating time travel and rollbacks. Users can specify a data retention interval and read table snapshots using timestamps or commit IDs. The CLONE command allows creating copies of tables at specific versions.

Efficient UPSERT, DELETE, and MERGE

Delta Lake supports transactional UPSERT, DELETE, and MERGE operations using add and remove records in the log files. These operations are executed atomically, simplifying data updates and ensuring consistency.

Streaming Ingest and Consumption

Delta Lake can act as a message queue, supporting streaming pipelines with features such as write compaction, exactly-once streaming writes, and efficient log tailing. Streaming producers can write small data objects for fast ingestion, which can later be compacted without affecting readers.

Data Layout Optimization

Delta Lake supports various layout optimizations, including:

OPTIMIZE command: Compacts small objects into larger ones.
Z-Ordering: Reorganizes records to achieve locality along multiple dimensions.
Auto Optimize: Automatically compacts newly written data.

Caching

Delta Lake caches data and log objects on clusters, accelerating metadata operations and improving performance.

Audit Logging

Delta Lake’s transaction logs can serve as audit logs, allowing users to view table history using the DESCRIBE HISTORY command.

Schema Evolution and Enforcement

Delta Lake supports transactional schema changes, updating physical objects as needed and maintaining a history of schema updates. Clients ensure newly written data adheres to the table’s schema.

Performance Observations

Databricks conducted performance experiments demonstrating Delta Lake’s advantages:

Impact of Many Objects or Partitions: Databricks Runtime with Delta Lake outperforms other systems, efficiently handling large numbers of partitions with optimized LIST operations.
Impact of Z-Ordering: Significantly reduces the number of Parquet objects read for queries.
TPC-DS Performance: Delta Lake shows superior performance on standard DBMS benchmarks.
Write Performance: Writing to Delta Lake is as efficient as writing to Parquet, with minimal overhead from statistics collection.

Conclusion

Delta Lake is a powerful tool that bridges data lakes and warehouses, offering robust ACID transactions and high performance on cloud object storage. Its design addresses the challenges of managing relational data in object stores, making it a valuable solution for modern data architectures. The next steps could involve exploring other open table formats like Iceberg or Hudi to understand their unique benefits and features.

Future Exploration

Considering the success and capabilities of Delta Lake, it would be interesting to explore other table formats like Iceberg or Hudi. These formats also aim to address similar challenges and could provide additional insights and improvements for data management in the cloud.

Outro Thank you for reading this comprehensive overview of Delta Lake. Stay tuned for more insights into other table formats and data management solutions in future articles. See you soon!