Demystifying Delta Lake

Saurabh Mishra
Analytics Vidhya
Published in
11 min readJul 29, 2021
Photo by Wynand Uys on Unsplash

In the real data world, the majority of the business problems get solved by ubiquitous relational databases and it is obviously a valid choice for several reasons. But with the advent of social media, machine data (IoT devices, PoS machines), and transactional data companies started to generate and consume large datasets. The Big Data term was coined to represent such unmanaged massive datasets. Solving business problems with traditional business intelligence tools especially the use of databases started to become a bottleneck as storing, managing, and computing such huge datasets required a cost-effective and fast processing solution. To solve these problems, Hadoop was created by Yahoo which referenced the concept of MapReduce a Google white paper. However, MapReduce had its own limitation for its conventional algorithms which were not designed around memory independence (It uses intermediate disk storage for its operation). Apart from that MapReduce and its related tools were mostly solving the problem of batch data processing and doesn’t have any support for ACID properties, real streaming data, and a good framework for data science to get insights for the big data.

So, to solve MapReduce’s problem, Apache spark was introduced as a unified analytics engine for big data and machine learning. The fundamental computational principle of Apache Spark is in-memory processing. However, besides the initial advancement of Spark, there still few fundamental gaps were present which did not fully support modern data architectures to store, manage, and process big data. For example, if we talk about Data lakes which are defined as

  • Storage that should be inexpensive and can have a variety of raw data in their native format (XML, JSON, CSV, ORC, PARQUET, etc)
  • Can maintain historical dumps for these distinct formats.
  • Historical data may require different types of data changes which further demand transactional operations on data.

The above properties defined as flexibility of Data Lake can make it Data Swamp if not organized and managed properly. The same is true for all file-based architectures where data get extracted from Data Lakes. Data Lake also brings challenges that are needed to solve.

  • Frequent refresh of metadata
  • Handle schema changes and also enforce schema.
  • Handling small size files(in case of streaming data ingestion and processing)
  • Modification of existing data.
  • Optimize performance
  • Managing and re-sorting data by an index if data is spread across many files and partitioned by some columns.

To keep the above challenges in mind, and not convert a precious data lake into a data swamp, Databricks launched a product named Delta Lake. It has been made open source and in this blog, we are going to see how is it useful and where can we leverage it in different data applications.

Delta Lake

Delta Lake is a file-based, open-source storage format that enables the organization to build modern Data Products to bring governance and structure of data to a single place and most importantly provide reliability on the data to ensure that insight generated by the organization is trustworthy.

It solves major challenges highlighted above by adopting

  • open-source file format (parquet)
  • First-class support for Business intelligence, Machine Learning, and Data Science.
  • Solve major challenges of data warehouse including

a. Reliability — Keeping the data lake and warehouse consistent

b. Data Staleness — Data in the warehouse is stale compared to the data lake as it gets fewer and infrequent updates.

Source: https://databricks.com/product/data-lakehouse

Now after reading the definition of Delta Lake a naive question might pop up …well, it is clear that this is some storage system and intelligently manage the gaps which were not possible to fill before but how exactly does data get organized and pipelined by Delta Lake?

This is achieved by following a Data Quality pattern from Data Ingestion to Data Consumption. As we move from Ingestion to Consumption, the quality of data keeps improving and getting better. To represent this idea, Delta Lake defined this data quality process into different layers which are called bronze, silver, and gold layers.

Bronze: Keep data in as-is form (raw form e.g. JSON, Parquet, IOT data, XML, etc)

Silver: More refined view of organization data. It can be query directly and data is clean, normalized, and can be considered as a single source of truth.

Gold: This is aggregated data layer reserved for different business use-cases.

From the Gold layer, reports, analytics can be built because that is the place where data is aggregated as per business use-case.

Indeed above are fancy names but it deserves because in each layer data quality get improves and in the final stage we get data that can be used for analytics.

To understand how the above challenges are addressed by Delta Lake, we need to deep dive into its key elements and features.

Delta Lake Elements

Delta lake has 4 main elements that act as a foundation to build Delta Lake features (features are described in the next section).

  1. Delta Files — It uses Parquet Files which is a columnar file system optimized to store tabular data.

Question: If it uses Parquet files, then how Delta is different?

Answer: Indeed Delta uses parquet files for its storage but the only difference between the Parquet and Delta tables is the _delta_log folder which stores the Delta transaction log and helps to achieve different Delta features (will be discussed in the next section).

2. Delta Tables — It is a collection of data which is consists of 3 things

a) Data files containing data

b) A Delta table registered in a Metastore which tracks the metadata.

c) A transaction log with Delta files to keep track of all transactions.

3. Delta Optimization Engine — Delta Engine accelerates data lake operations and supports a variety of workloads ranging from large-scale ETL processing to ad-hoc, interactive queries. Many of the optimizations get applied automatically.

4. Delta Lake Storage Layer — A place where the organization keeps its data in object storage at a cheap cost and later Delta can be used to access these files.

Delta Lake Features

Compatible with Apache Spark API

This is an important feature to switch the underlying storage and application smoothly to Delta Lake. Being compatible with Apache Spark makes it easy to leverage new features. This particular feature is useful for such projects which are developed on Spark but need migration to the Delta.

Format required this flow to go into delta

At file level

Any file format -> parquet -> Delta

At code level

Little tweak as Delta built on the same Spark engine.

Look at the below snippet-

from delta.tables import DeltaTable 
parquet_table = "some_parquet_table"
partitioning_scheme = "some_id int"
DeltaTable.convertToDelta(spark, parquet_table, partitioning_scheme)

Isn’t the above snippets look familiar like Spark code?

ACID transaction

All data changes written to Delta Storage are committed for durability and guarantee no more partial or corrupted files. It means the reader can read consistent data.

To understand the ACID behavior, we need to unpack the transaction files which get maintained by Delta lake to facilitate this property.

To check the transaction log, we can list the _delta_log folders where all transaction-related data get captured.

Inside the folder _delta_log, we can see two files are created as .crc and .json with a big sequence of zero. If any operation happens on data then another .json and .crc files with incremental sequences get generated. It’s important to understand what content these files capture and how are they beneficial for several purposes.

CRC (cyclic redundancy check) file helps spark to optimize its query as it provides key statistics for the data.

On the other hand, JSON file captures a lot of info. Let's read the JSON file and list the columns info then we see there are 4 columns named add, commitInfo, metaData, and protocol.

If further expand the add column we get the content confirming data change, modificationTime, path, size, and basic stats of the data.

expanding add column

expanding metaData column

commitInfo — Information on commits.

protocol — Enables new features by switching the Delta Lake transaction log to the newest software protocol (reader and writer version)

Now let’s try to create a table from the written path and also try to update the records from the table.

Data is updated for new id value. So, internally .json and .crc files have also been changed to capture this new data change. Let’s re-read _delta_log folder and its files

What changes do we see now?

We can see as the table is updated, another JSON & CRC file gets created. If we open the 00000000000000000001.json file and see the columns (did you notice something here? In file 00000000000000000000.json, column add was present because we added the records and now an extra column is added as remove to show which data is updated) and content then we can find all the information of the transactions which have been taken place for the dataset.

So, for every operation, Delta internally maintains the files to captures its related info and this is where a lot of extra information gets maintained to take the decision to handle different operations and also enable all good features on Data which was not possible before for other file formats.

Takeaway–

Delta Lake Transaction Log is a single source of truth for our data. ACID also enables Delta to support upsert and merge.

Schema on Read

It is the concept where data is applied to a plan or schema when it is pulled out of a stored location.

Before Delta, in Spark or Hadoop traditional architecture whenever a table was added with the additional dataset, every time to read all newly added datasets with old records, we required to MSCK REPAIR TABLE command to refresh the metastore of the backed file system. But, in delta this problem is resolved, and whenever the user query the data, is ensured to get the latest dataset.

Data Versioning aka Time Travel

Data versioning is a similar concept which we follow on code versioning (git or any other code management software). We can switch to a different version of the dataset based on the time or specific number(version). This feature simplifies building the pipelines for the dataset which keeps changing from time to time. Constantly changing data brings auditing challenges, reproducing experiments, fixing data mistakes, and a rollback to switch the desired state if required.

How to use it?

It is easy to select data for a specific version or timestamp from a delta table.

.option("timestampAsOf", timestamp_expr) => to get specific timestamp e.g. option("timestampAsOf", "2018–10–18T22:15:12.013Z")
.option("versionAsOf", version) => to get specific version e.g. option("versionAsOf", 9)

It is also possible to roll back to a different version

Schema enforcement or Schema Validation

Checks schema validation of the data on write and helps to ensure the data correctness. It is a mandatory check to reject writes to a table that don’t match with the defined schema.

These rules are followed to determine whether a write to a table is allowed or not

  • Cannot contain any additional columns that are not present in the target table’s schema
  • Cannot have column data types that differ from the column data types in the target table.
  • Cannot contain column names that differ only by case (Delta is case sensitive — Delta and delta is not the same)

Note — This feature helps to validate a lot of things for which generally we create custom data validators in real projects. Just having a standard delta format is a rescue to avoid a lot of data issues.

Schema Evolution

It helps users to easily modify the table’s current schema to accommodate new data changing over time. General use of this feature is during an append and overwrite operation to automatically adapt the schema to include one or more new columns.

How to use it?

During the writing of the dataframe into a delta file, we need to set the mergeSchema property to true.

df.write.format("delta")\
.mode("overwrite")\
.option("path", "file_path")\
.option("mergeSchema", "true")\
.partitionBy("partition_col")\
.saveAsTable("table_name")

It is also possible to retrieve Delta table history

DESCRIBE HISTORY <table_name>-- get the full history of the table
DESCRIBE HISTORY delta.<delta_file_path>

Optimization

So till now, we have seen that Delta files capture several statistics around the operating file. These statistics have been written by Delta lake creators for a purpose and one of the obvious purposes is to leverage the optimization and enhance the performance of Delta Lake.

In this final section, we will learn some optimization techniques which can be used (if required) to improve the performance of the delta files/tables.

a. Optimize by Compaction

Optimize operation performs file compaction i.e. small files are compacted together into larger files (up to 1 GB).

Delta doesn’t support OPTIMIZE as an automatic process but expects users to perform this based on need.

How to use it?

Optimize <table_name> OR delta.`delta file location`

b. Z-Order

This is a technique to collocate multidimensional data to one dimension by preserving the locality of the data points. In simple words — it is a way to keep related information in the same set of files. Internally it creates fewer files to keep data and when a search or query gets fired it scan fewer files to get the result.

Once we apply ZORDER on certain columns (cannot apply on partition column), Delta perform the below activities under the hood

  • Takes existing parquet files within a partition.
  • Maps the rows within the parquet files according to the selected column for ZORDER.
  • In the case of only one column, the mapping above becomes a linear sort.
  • Rewrites the sorted data into new parquet files.

Question: How to choose ZORDER columns?

Answer: The thumb rule is to select those columns which filter data with the largest set (low cardinal column but don’t choose partitioned columns). Multiple columns can also be selected but that reduces the effectiveness of locality.

How to use it?

OPTIMIZE table_identifier [WHERE clause][ZORDER BY (col_name1, col_name2, ...)]  -- be mindful not to use partition column.

c. Partition Pruning

It is used to speed up the query to minimize the amount of data read. Simple meaning always put partition column(s) in where clause to filter the data.

d. Data Skipping

Apply where clause to add more columns in the query to filter the records and give hints to delta files ( files that keep statistics) to skip unnecessary information.

e. Vacuum

Handy command to keep only relevant files and save the cost on the storage.

How to use it?

VACUUM <name-of-table> RETAIN <number-of-hours>

I have tried to capture my understanding of the delta and I found it is important to understand its functionalities so that this can be leveraged fully for different use-cases. If we are working to build or design any analytical solutions (data lake, data warehouse, machine learning, handling streaming data, etc.) Delta can be safely chosen as standard storage for data processing.

Thank you for reading 💕

References

This is an excellent book written by Delta lake creator if you want to further deep dive into it.

https://databricks.com/wp-content/uploads/2020/08/p975-armbrust.pdf

http://cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf

https://databricks.com/product/delta-lake-on-databricks

https://docs.databricks.com/delta/optimizations/file-mgmt.html

--

--