Delta Lake for Beginners: Data Lake + Data Warehouse And More

Published in

Towards Dev

11 min readJul 30, 2023

Welcome to this beginner’s guide to Delta Lake! If you are interested in big data, this guide is for you. We’ll explain everything in a way that’s easy to understand, even if you’re new to this field. Let’s get started!

Traditional Data Lakes and Data Warehouses

Imagine you have a giant box where you throw in all your toys, books, clothes, and everything else. That’s like a data lake — a place where companies store all their data, no matter what type it is. But finding something specific in this box can be tough because everything is mixed up. Plus, there’s no way to ensure that the data is reliable and consistent.

Now, think of a well-organized closet with sections for shirts, pants, socks, and so on. That’s like a data warehouse — a place where data is organized and easy to find. But what if you get a new type of item, like a hat? You’d need to create a new section in your closet, which can be a hassle. Also, data warehouses can be expensive and aren’t great for handling new types of data, like streaming data.

The Drawbacks

With the toy box (data lake), it’s hard to find what you need because everything is mixed up. Plus, there’s no way to ensure that the data is reliable and consistent. With the closet (data warehouse), it’s hard to add new types of items because you need to reorganize everything. Also, data warehouses can be expensive and aren’t great for handling new types of data, like streaming data.

The Data Lakehouse (Delta Lake)

What if you could have the best of both — a place where you can store any type of item and still find what you need easily? That’s what a data lakehouse like Delta Lake does. It lets you store all types of data (like a data lake) and keeps it well-organized (like a data warehouse). Plus, it adds features like ACID transactions (which ensure data reliability) and schema enforcement (which keeps the data consistent).

Delta Lake is an open-source storage layer built atop a data lake that confers reliability and ACID (Atomicity, Consistency, Isolation, and Durability) transactions. It enables a continuous and simplified data architecture for organizations. Delta lake stores data in Parquet formats and enables a lakehouse data architecture, which helps organizations achieve a single, continuous data system that combines the best features of both the data warehouse and data lake while supporting streaming and batch processing.

Delta Lake Architecture

Delta Lake is like a super-smart robot that helps you manage your data. It has a few main parts:

Transactional Log: This is like a diary that keeps track of everything you do with your data. It’s a record of all the changes made to the data, which ensures data integrity and allows for rollbacks in case of errors.
Data Files: This is where your actual data is stored, like the toys in your toy box. Delta Lake stores data in Parquet format, a columnar storage file format that is optimized for speed and efficiency.
Checkpoint: This is a quick snapshot of the diary at a certain point, which helps Delta Lake find data faster. It’s a summary of the transaction log up to a certain point, which speeds up the reading of data.
Readers and Writers: These are the parts of the robot that read data from and write data to the Delta Lake. They interact with the transactional log to ensure data consistency and integrity.

Delta Lake is an improvement from the lambda architecture whereby streaming and batch processing occur parallel, and results merge to provide a query response. However, this method means more complexity and difficulty in maintaining and operating both the streaming and batch processes. Unlike the lambda architecture, Delta Lake is a continuous data architecture that combines streaming and batch workflows in a shared file store through a connected pipeline.

The stored data file has three layers, with the data getting more refined as it progresses downstream in the dataflow;

Bronze tables: This table contains the raw data ingested from multiple sources like the Internet of Things (IoT) systems, CRM, RDBMS, and JSON files.
Silver tables: This layer contains a more refined view of our data after undergoing transformation and feature engineering processes.
Gold tables: This final layer is often made available for end users in BI reporting and analysis or use in machine learning processes.

The three layers are referred to as a multi-hop architecture.

Setting Up Delta Lake on Azure

Setting up Delta Lake on Azure is like building a LEGO set — you follow the instructions step by step. Here’s how you do it:

Create an Azure account: If you don’t have one already, you’ll need to create an Azure account. You can do this on the Azure website.

2. Create a new Storage Account: Create a Storage account and Container that will host your Delta Lake. I created a storage account called “deltalakemediumdemo” and a container within called “bronze”. Follow the steps from this article to create a Storage account and a container.

Building an End-to-End Data Pipeline with Databricks, Azure Blob Storage, and Azure Data Factory

Introduction

towardsdev.com

Create a storage account and a container

3. Create a Databricks workspace: In the Azure portal, go to the Databricks section and create a new workspace. You can refer to my previous article to set up Databricks workspace, Notebook, and Cluster.

Azure Databricks for Beginners: Build Your First Notebook

Welcome to this beginner’s guide to Azure Databricks. Whether you’re an aspiring data engineer, or a big data enthusiast…

towardsdev.com

4. Launch the Databricks workspace: Once the workspace is created, launch it by clicking on the “Launch Workspace” button.

5. Install Delta Lake: In your Databricks workspace, create a new notebook. In the notebook, you can install Delta Lake by running the command

%pip install delta-spark.

6. Mount Azure Blob Storage: To connect your Delta Lake to Azure Blob Storage, you’ll need to mount the storage blob to Databricks. Here’s a script that shows how to do it:

dbutils.fs.mount(
  source = "wasbs://<container-name>@<storage-account-name>.blob.core.windows.net",
  mount_point = "/mnt/<mount-name>",
  extra_configs = {"fs.azure.account.key.<storage-account-name>.blob.core.windows.net":"<access-key>"})

Replace <container-name>, <storage-account-name>, <mount-name>, and <access-key> with your Azure Blob Storage details.

Here is how you can find these details:

<container-name> : bronze

<storage-account-name> : deltalakemediumdemo

<mount-name> : /mnt/deltalakemediumdemo/bronze

<access-key> :

Navigate to your Storage Account Service. Under “Access Keys”, copy the “key” value from either of the 2 available keys and paste it into your code.

Working with Data in Delta Lake

Writing Data to Delta Lake

Writing data to Delta Lake is straightforward and similar to writing data to a traditional data lake. However, Delta Lake provides additional benefits such as transactional guarantees and schema enforcement.

Let’s say we have a DataFrame df that we want to write to a Delta Lake table. Here's how we can do it:

First, we have to create a database in our container location:

%sql
CREATE DATABASE IF NOT EXISTS database_name
LOCATION '/mnt/<mount-name>/'

Replace <mount-name> with your actual mount point.

Next, let’s read a sample CSV data provided by Databricks and modify the column names to comply with Deltalake. (No special characters in Column names, including spaces).

df = spark.read.format('csv').options(header='true', inferSchema='true').load("/databricks-datasets/samples/population-vs-price/data_geo.csv")

# Iterate over the columns
for col in df.columns:
    # Remove spaces in the column name
    df = df.withColumnRenamed(col, col.replace(' ', ''))

This script reads the DataFrame df from Databricks default file storage and replaces all spaces with no spaces in the column names.

Next, write this Dataframe to a Delta Lake table.

# Write the DataFrame to a Delta Lake table
df.write.format("delta").mode("overwrite").saveAsTable("database_name.table_name")

# Check if table is in our container
dbutils.fs.ls("/mnt/<mount-name>")

If the table doesn't exist, Delta Lake will create it. If it does exist, Delta Lake will overwrite the data in the table.

One of the benefits of writing data to Delta Lake is that it provides transactional guarantees. This means that each write operation is atomic — it either fully succeeds or fully fails. If a write operation fails, Delta Lake ensures that the table remains in a consistent state, without any partial updates.

Another benefit is schema enforcement. When writing data to a Delta Lake table, Delta Lake checks that the data matches the table’s schema. If the data doesn’t match the schema, Delta Lake will reject the write operation. This helps to ensure that the data in the table is always in the expected format.

Reading Data from Delta Lake

Reading data from Delta Lake is also straightforward. Here’s how you can do it:

data = spark.read.format("delta").load("/mnt/<mount-name>/data_geo")
# Display the data
data.show()

Replace <mount-name> with your actual mount point. This script reads the data from the Delta Lake table and displays it.

One of the benefits of reading data from Delta Lake is that it provides snapshot isolation. This means that each read operation sees a consistent snapshot of the data, even if other operations are modifying the data at the same time. This ensures that the data you read is always consistent, without any anomalies or inconsistencies caused by concurrent write operations.

Updating Data in Delta Lake

Suppose we have new information that the 2014 population estimate for Alabama was actually 5,000,000. In a traditional data lake, we would have to read the entire dataset, find the record for Alabama, update the population estimate, and then overwrite the entire dataset with the updated data.

With Delta Lake, we can directly update the record for Alabama in the Delta Lake table:

from delta.tables import *
from pyspark.sql.functions import expr

deltaTable = DeltaTable.forPath(spark, "/mnt/<mount-name>/data_geo")
deltaTable.update(
  condition = expr("`State` = 'Alabama'"),
  set = {"`2014Populationestimate`": expr("5000000")})

This script directly updates the ‘2014Populationestimate’ column for Alabama in the Delta Lake table, without having to read and overwrite the entire dataset.

Deleting Data in Delta Lake

Now suppose we want to delete the record for Alabama. In a traditional data lake, we would have to read the entire dataset, filter out the record for Alabama, and then overwrite the entire dataset with the remaining data.

With Delta Lake, we can directly delete the record for Alabama from the Delta Lake table:

deltaTable.delete(condition = expr("`State` = 'Alabama'"))

This script directly deletes the record for Alabama from the Delta Lake table, without having to read and overwrite the entire dataset.

Merging Data to Delta Lake

Finally, suppose we have new population data for Alabama that we want to merge into our existing data. In a traditional data lake, we would have to read both the existing dataset and the new data, perform the merge operation, and then overwrite the existing dataset with the merged data.

With Delta Lake, we can directly merge the new data into the Delta Lake table:

# newData is a DataFrame with the same schema as deltaTable
newData = spark.createDataFrame([("Alabama", "AL", 5000000, 123456)], ["State", "StateCode", "2014Populationestimate", "2015mediansalesprice"])

deltaTable.alias("oldData")\
  .merge(newData.alias("newData"), "oldData.State = newData.State")\
  .whenMatchedUpdate(set = {"`2014Populationestimate`": expr("newData.`2014Populationestimate`"), "`2015mediansalesprice`": expr("newData.`2015mediansalesprice`")})\
  .whenNotMatchedInsert(values = {"State": expr("newData.State"), "StateCode": expr("newData.`StateCode`"), "`2014Populationestimate`": expr("newData.`2014Populationestimate`"), "`2015mediansalesprice`": expr("newData.`2015mediansalesprice`")})\
  .execute()

This script merges the new data into the Delta Lake table, updating the ‘2014Populationestimate’ and ‘2015mediansalesprice’ for Alabama if it exists, or inserting a new record for Alabama if it doesn’t. This is done directly in the Delta Lake table, without having to read and overwrite the entire dataset.

History in Delta Lake

Delta Lake maintains a detailed history of all the operations that have been performed on a table. This can be very useful for understanding how your data has changed over time, debugging issues, and auditing data changes. Which is otherwise not possible in a Data Lake.

Let’s see how we can view the history of our Delta Lake table:

from delta.tables import *

deltaTable = DeltaTable.forPath(spark, "/mnt/<mount-name>/data_geo")
# Display the history of the table
deltaTable.history().show()

View the History of a Table in Delta Lake

Replace <mount-name> with your actual mount point. This script displays a table with information about each operation that has been performed on the Delta Lake table, including the version number, timestamp, operation type (like WRITE, UPDATE, DELETE), and operation parameters.

Time Travel in Delta Lake

One of the most powerful features of Delta Lake is its ability to “travel back in time”. This means that you can query a previous version of your data, which can be very useful for debugging issues, recovering from mistakes, and auditing data changes.

Let’s see how we can use time travel to query a previous version of our Delta Lake table:

# Load version 0 of the data
data = spark.read.format("delta").option("versionAsOf", 0).load("/mnt/<mount-name>/data_geo")
# Display the data
data.show()

Replace <mount-name> with your actual mount point, and replace the version number with the version you want to load. This script loads the specified version of the data from the Delta Lake table and displays it.

For example, if you made a mistake in version 1 of your data, you can go back to version 0like this:

# Load version 2 of the data
data = spark.read.format("delta").option("versionAsOf", 0).load("/mnt/<mount-name>/data_geo")
# Display the data
data.filter(data.State == 'Alabama').show()

This will load the data as it was in version 0, effectively undoing the changes made in version 1.

In summary, the history and time travel features of Delta Lake provide powerful capabilities for understanding, debugging, and auditing your data. They give you a detailed record of how your data has changed over time, and the ability to go back to a previous state of your data if needed.

Conclusion

Delta Lake is a powerful data management solution that addresses the limitations of traditional data lakes and data warehouses by offering scalability, reliability, and the ability to perform complex operations like updates, deletes, and merges. It also provides unique features such as history tracking and time travel, enhancing data auditing and debugging capabilities. With seamless integration with tools like Azure Databricks and Azure Blob Storage, and features like transactional guarantees, schema enforcement, and unified batch and real-time processing, Delta Lake simplifies data management and boosts the efficiency of data pipelines. As you explore Delta Lake further, you’ll uncover even more ways it can revolutionize your data operations, regardless of whether you’re a data engineer, data scientist, or data analyst.

Thanks for Reading!

If you like my work and want to support me…

The BEST way to support me is by following me on Medium.
I share content about #dataengineering. Let’s connect on LinkedIn.
Feel free to give claps so I know how helpful this post was for you.