Delta Lake Open Source Project

12 min readMay 13, 2019

The first unified data management system.

Delta Lake is an open source project that brings a lot of important features to Apache Spark and big data workloads.

The Evolution of Big data
The current state of platforms
Delta Lake
Delta Lake Under the Hood
Delta Architecture

The Evolution of Big data

All started in the 70s when the companies had all this data in their databases about how their products were doing and different revenue in different region and so on, but they couldn’t actually put it all together they were flying blind and people came up with this concept of a data warehouse and the companies decided to take the most important data that they have in their different databases, and let’s ETL (extract, transform, and load) them all into a central data warehouse and they could get business intelligence, so the data was in it was really pristine it was reliable it was cleaned up with a particular scheme, and they could do really fast queries on it so these were really performant systems because they had such control over the data that they’d put in there you could actually optimize it and get really great performance out of it. Finally, it gave you transactional reliability which meant that you could have lots of different business analysts using these Business Intelligence tools simultaneously and using concurrently the same data warehouse and it just worked, no errors, no corruption so this was a great thing about data warehouses. However, this was 30–40 years ago and things have evolved quite a bit since then, and over time people started to see the problems that these data warehouse had, in particular it was hard to scale them, so this was good for the pristine small data that they had, but if they wanted to have lots of lots of data and they want to take it all in there it was hard to scale it, it was not elastic so they couldn’t just say okay We want to double the size of our data warehouse, that was a complicated operation and then it required this ETL to get your data into the data warehouse, that ETL process was very cumbersome so they did it once a week, so hence the data that they had in their data warehouse was stale, and they were actually operating on old data, so it wasn’t real time. Also, the companies have another important topic in mind, and that is AI.

Artificial Intelligence is one of the trending topics since many years ago, but the companies had all their data inside of data warehouse but they couldn’t ask to the data warehouse to do prediction for them, because it just support SQL, and then finally it was all based on closed formats so once you had your data in this system it was hard to kind of get it out so you were locked into these data warehouse.

*Not future proof — Missing Predictions, Real-time, Scale*

So what happened is about 10 years ago the Hadoop vendor showed up and they said what you need is a Hadoop Data Lake, and what Hadoop Data Lake you will let you do is that you can now ETL all your data, it’s scalable thing so take all of your data and just dump it in to this open Data Lake and you can do all kinds of uses cases on top of it since these are open formats you can get all the machine learning, all this stuff working so all the problems that you had with your data warehouse will go away and you get all these benefits now with the Data Lake. So now you got massive scale so you could store petabytes of data in it, it was inexpensive, it was based on this open-source Hadoop technology, and it was based on open formats like Parquet and ORC, so you could easily get your data and move it around and it had the promise of AI and real-time streaming, so that was really great people got really excited and started storing all their data in these data lakes in over the last 10 years they’ve been storing more and more of this.

*Become a cheap messy data store with poor performance*

A data lake is a storage repository that inexpensively stores a vast amount of raw data, both current and historical, in native formats such as XML, JSON, CSV, and Parquet. It may contain operational relational databases with live transactional data.

To extract meaningful information from a data lake, you must solve problems such as:

Schema enforcement when new tables are introduced.
Table repairs when any new data is inserted into the data lake.
Frequent refreshes of metadata.
Bottlenecks of small file sizes for distributed computations.
Difficulty sorting data by an index if data is spread across many files and partitioned.

AI applications are running into a lot of trouble, so the problem now that you have with the data lakes it’s kind of the same things that were the advantages that we had with the data warehouse, so we kind of lost them so in particular since you just dumped all your data there you lost thus pristine data with the control and the schemas on it and this data now is inconsistent it might be dirty it might not be throughout the petabyte of data set you can’t just rely that it’s just going to work, so it’s hard to build analytics on top of this data, and finally the performance actually is really bad because you’ve just dumped the data in there it hasn’t been optimized for your particular use case so a lot of companies are seeing that once they start building their applications on top of the data lake they don’t get the performance that they were getting from data warehouses earlier so, in short, you’ve got a messy data store with really poor performance.

The hardest part of Artificial Intelligence is not Artificial Intelligence, in other words, Big Data was the Missing Link for AI, and most companies are struggling with Big Data.

The current state of platforms

So, what the enterprises are doing today? Which one of these has chosen? Data Lake or Data Warehouse?

*Enterprises with Data Lake and Data Warehouse*

The enterprises that they are using both, have also different problems:

They just can have around 2 weeks of data into the data warehouse.
The data warehouse is very expensive to scale.
The architecture has poor agility in responding to new threats.
The architecture has scale limitations, no historical data.
The architecture needs more time and people to build than an architecture based on Delta Lake.

What are some of the pain points of existing data pipelines?

Introduction of new tables requires schema creation.
Whenever any new data is inserted into the data lake, table repairs are required.
Metadata must be frequently refreshed.
Small file sizes become a bottleneck for distributed computations.
If data is sorted by a particular index (i.e. eventTime), it is very difficult to re-sort the data by a different index (i.e. userID)

Delta Lake

Delta Lake is an open source project that brings ACID transactions to Apache Spark™ and big data workloads. Key Features:

ACID Transactions: Data lakes typically have multiple data pipelines reading and writing data concurrently, and data engineers have to go through a tedious process to ensure data integrity, due to the lack of transactions. Delta Lake brings ACID transactions to your data lakes. It provides serializability, the strongest level of isolation level.
Scalable Metadata Handling: In big data, even the metadata itself can be “big data”. Delta Lake treats metadata just like data, leveraging Spark’s distributed processing power to handle all its metadata. As a result, Delta Lake can handle petabyte-scale tables with billions of partitions and files at ease.
Time Travel (data versioning): Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet. Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
Schema Enforcement: Delta Lake provides the ability to specify your schema and enforce it. This helps ensure that the data types are correct and required columns are present, preventing bad data from causing data corruption.
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.
DELETES/UPDATES/UPSERTS — Writers can modify a data set without interfering with jobs reading the data set.
Automatic file management — Data access speeds up by organizing data into large files that can be read efficiently.
Statistics and data skipping — Reads are 10–100x faster when statistics are tracked about the data in each file, allowing Delta to avoid reading irrelevant information.
To avoid Data Swamp, or Data Cesspool.
To solve problems related to schema on read.

It’s the first unified data management system that unifies data warehousing with data lakes, in particular, it gives you the scale of the data lake, it gives you the reliability and performance of the data warehouse, and it gives you the low latency that you get out of real-time streaming engines. Delta Lake gets the good of the data lakes which is it gets that massive scale you store your data just on blob stores like Amazon S3, Azure Data Lake, HDFS, etc.

*Enables predictions, real-time and ad-hoc analytics at Massive Scale*

It’s based on open formats so Parquet and ORC you can just have your data in those formats that you can read elsewhere but you can also do predictions, machine learning and real-time streaming with it but in addition to that Delta Lake stores control delta and makes sure that your data is reliable so you get the good of the data warehouse as well so we try to keep your data really pristine in Delta Lake we give your transactional reliability so that you don’t get failures you just have atomic transactions and you get really fast queries there’s a lot of optimizations that go in this data when you move it into the system we see anywhere between 10 to 100 X speed-up over just if you took just Apache Spark and ran it on a traditional data lake yourself, so in short it enables predictions real-time and ad-hoc analytics at massive scale. Delta Lake is very important for a data lake that incorporates streaming data because refreshing frequent metadata, repairs table, and it is able to accumulate small files on a secondly or minutely basis.

Delta Lake Under the Hood

The massive scale comes from the fact that we decouple compute in storage so it’s not coupled appliance like traditional data warehouse instead your data is stored on a blob store like Amazon S3, and you don’t have to pay a lot for that, these blob stores are quite cheap today independently you can scale your compute as you need so as more and more people use the system during the day it automatically scales up and down so you get this elasticity so that’s how we get massive scale.

Reliability, that’s through asset transactions and data validation we actually validate your data when it comes in so that it actually follows the schema and if doesn’t we actually reject it and give you a chance to fix that.
The performance we’ve added the traditional data warehouse techniques into Delta Lake so that we do indexing and caching of your data that way we can get the 10-100x speed
The whole system is based on Structured Streaming from Apache Spark so you get the real-time streaming to ingest that spark provides to you.

Delta Architecture

The Delta Architecture is an evolution of Kappa Architecture and solves the problems that Lambda Architecture has.

Lambda Architecture has problems like:

Historical queries: We have to build two separate pipelines, which can be very complex, and, ultimately, difficult to combine the processing of batch and real-time data, in other words, you need to duplicate your pipeline, one for streaming, and another for batching.
Messy data: it turns out when different teams putting events into a Stream Processing System like Kafka Stream, Kinesis or Event Hub, it is important to have validation process, to check both the streaming version and the batch version.
Mistakes and failures: that validation is great but it’s actually too late by the time it’s gone off you know that bad data has already been produced it’s in the data lake and so we need some way to correct it and this is actually pretty difficult especially in a distributed system, it’s not only human errors, also the spot price is spiked and my cluster died and half of the results were written out and now we have to clean it up, so a pretty typical technique here is if you have a partition by some granularity, let’s say by date or by hour and so now I have these nice clean boundaries, and we can build a bunch of processes that allows doing reprocessing, so if anything goes wrong We can fix the mistake and then just go back and reprocess.
Query Performance: Millions of files are keeping on S3 without compaction, so the result is that a query on this data may be very slow due to network latency and volume of file metadata.

The Delta Architecture is a vast improvement upon the traditional Lambda Architecture. In the case of Delta Lake, we can do queries with performance similar to data warehouse because Delta Lake maintains logs and indexes, that allows us to skip and read only the data that’s necessary. With the previous architecture we had to manually handle unifying streaming and batch We had to do manual validation We had to worry about the logic and kind of the boundaries at which We was doing reprocessing We also had to handle compaction, with Delta Lake axes this unified store where I can take for a data from a variety of sources whether it’s Kafka, Kinesis or whatever We can put there and then all of my users can read it kind of with the strong guarantees of a data warehouse but really the nice thing here is we still maintain the scale of a data lake we have transaction we get reliability, we have indexes, we get performance and it integrates deeply with Spark so we can do latency streaming. Delta Lake delivers a powerful transactional storage layer by harnessing the power of Apache Spark and Databricks File System (DBFS). It is a single data management tool that combines the scale of a data lake, the reliability, and performance of a data warehouse, and the low latency of streaming in a single system. The core abstraction of Delta Lake is an optimized Spark table that stores data as parquet files in DBFS and maintains a transaction log that tracks changes to the table.

Bronze level or row tables: Text files, RDBMS data, and streaming data are collected into the bronze level. Raw tables capture streaming and historical data into a permanent record (streaming data tends to disappear after a short while). Though, it’s generally hard to query.
Silver level or query tables: A raw table is then parsed into query tables, also known as silver. Query tables consist of normalized raw data that is easier to query.
Gold level or summary tables: Summary tables, also known as gold tables, are business-level aggregates. They’re often used for reporting, dashboards, and aggregations such as daily active website users. Summary tables contain aggregated key business metrics that are queried frequently, but the silver queries themselves would take too long.
Platinum level or visualization level: It is the layer for creating business reports, dashboards, and visualizations.
Diamond or dAImond level: It is the layer that contains Artificial Intelligence logic, can obtain the data from the bronze, silver, or gold level. In that layer solutions like Mlflow, Koalas, TensorFrames, and others will be running.

This unified approach means that there is less complexity due to the removal of storage systems and data management steps, and, more importantly, output queries can be performed on streaming and historical data at the same time. The difference between Lambda and Delta architecture is that with Delta architecture, output queries can be performed on streaming and historical data at the same time. In Lambda architecture, streaming and historical data are treated as two separate branches feeding output queries.