Keeping your data pipelines balanced with Delta Lake

Phil Hetzel
Slalom Data & AI
Published in
7 min readDec 2, 2019
Photo by Loic Leray on Unsplash

At any point in time, you are about two or three clicks away from reading an article about data science, machine learning, or artificial intelligence. As an analytics architect, these topics are broached in nearly every meeting that I attend with my clients or prospective clients. In fact, a recent Gartner survey measured that 59% of survey respondents had some type of artificial intelligence deployed.

I am excited about the prospects of this; artificial intelligence and machine learning are having a transformational effect on how companies can service their customers. However, I am also cautious. The same Gartner survey shared that 34% of respondents claimed that data quality and scope were the top challenges for their artificial intelligence initiatives. Dirty data fed into algorithms can do more harm than good.

There is a very specific kind of dirty data that happens as a result of data engineering that I don’t believe gets enough attention: balance. Balancing your data engineering pipelines refers to making sure that your data transformations output an expected number of rows. The expected number of rows during initial data pipelines will typically be dictated by the number of rows in the source system. (The data engineer wants to be able to represent the source system with high fidelity before it is altered by downstream business logic.) It’s my firm belief that by ensuring that your pipelines are balanced, data engineers will avoid most downstream data quality issues, leading to better analytical outcomes.

A simple example

A data engineer wants to take a snapshot for the Transactions table in an operational database and load the snapshots into the raw zone of a data lake.

The data engineer can do this with a high level of confidence that this stage of the pipeline will be balanced. Each data lake snapshot file will have the same row count as the table that it was taken from. Business users can query the most recent snapshot of the data and have comfort that the dataset has high fidelity. However, it is not always best to take snapshots of tables; what if the table is extremely large? This will force storage costs to increase exponentially.

To account for this, the data engineer implements incremental logic when loading data from the Transactions table (i.e. load only the rows that were added since the last pipeline run). This makes the pipeline far more efficient and the associated costs of the pipeline go down.

Snapshot pulls are reliable but inefficient

But there is a problem: several weeks after the pipeline was implemented, business users begin reporting that their analytics are not matching their expectations. The data engineer needs to think about where things could be going awry. In order to investigate, the engineer takes a row count of the Transactions table in the source, and then the combined row count of all incremental files. The row counts are drastically different from the source and the data lake files!

Measuring What Matters

The data engineer knows that something about the data pipeline is off but has no idea when the problem occurred. When our team builds data pipelines, we make sure that we measure the row counts of the source and the source’s matching data lake entity, then store that information into a “balance” table. The balance table should look at the current rows for a database table at the time the pipeline was run and compare that to the number of active rows in the data lake entity. This process is explained further in depth below.

Back to the example

The incremental logic employed by the data engineer looks at loading rows that have a system modified date of after the last time the pipeline was run and before the pipeline was triggered. The figure below visualizes two incremental jobs on the source table called Transactions. Additionally, the data engineer has learned the error of their ways and implemented processes to store row counts of the source and the data lake’s active row counts for the Transactions table after each pipeline run in a balance table. Data lake entity rows are counted through a tool like Hive or Presto. In the example below, the incremental data pipeline has been run twice, resulting in two batches of data loaded to the data lake.

Everything balances, what could go wrong?

Everything should be balanced in this situation because the pipeline should only have to append data into the data lake entity. The data engineer can look at the balance table and feel comfortable that they are maintaining data integrity in the data lake. However, what happens when there is an update to the data? Below shows that there was an edit to Transaction Number 3 that changed the Amount field, then loaded to the data lake in a third incremental batch.

Updated data showing as an extra row in the data lake

Transaction 3 has an updated modified time and was included in the third pipeline run. In a typical data lake scenario that only loads files into data lake storage, there will be one more record in the table than in the source when a business analyst queries this data. Imbalance!

Delta Lake can help correct both the balance measurements and how the business queries the data with minimal effort.

Delta Lake 101

Delta Lake is an open source library made by Databricks (the same folks that open sourced Apache Spark) that takes the benefits of a database (ACID transactions, DML statements to upsert data) with the benefits of a data lake (schema flexibility, scalability). Delta Lake allows you to create “Delta tables” to treat groups of similar data lake files as one entity (similar to other frameworks like Hive, Presto, and SparkSQL). However unlike the other frameworks, being able to perform Update, Insert, and Delete statements on these Delta tables is a game changer! Let’s explore how Delta Lake allows us to gain comfort that our pipelines are balanced.

How Delta Lake Helps

Using Delta Lake, the data engineer creates a Delta table around the Transactions files in the data lake, allowing for ACID and DML-enabled transactions to occur. The Delta table can also be queried with SQL, which will be familiar to business analysts.

The data engineer can capture this change in Transaction 3 while also feeling comfortable about our balance by treating our Delta table similar to a Type II Slowly Changing Dimension. Instead of simply loading batches of data into data lake file storage, the data engineer can perform “upsert” statements on Delta tables. we add a current flag to each row and “turn off and on” rows as they receive changes. Here would be a look at our Delta table at the end of the third pipeline run:

History is captured and the present is represented

When the data engineer takes the balance of this table to compare with the source, they will only include rows where current equals TRUE. This will return six rows to match the six rows from the source, allowing the data engineer to rest easy knowing that their data pipeline is balanced.

Other Cases

The above example is prevalent in any data lake scenario; however there are many different reasons why your source table would not match the “active” rows in the data lake entity. Situations that our team has come across are:

· Source row deletions, causing the source row count to be lower than the target row count

· Improper incremental pipeline logic, causing source row counts to be lower or higher than the target row count

· Not using a system generated modified date in the incremental logic, causing source row counts to be lower that the target row count

· Tables are joined during the incremental query. Joins can cause duplicate rows when at least one table’s join column is not unique, causing higher target row counts. Joins can also inadvertently filter records when using inner joins, causing lower target row counts

Using a balance table helps to surface these problems early and will speed up the development process overall.

Looking forward

Our team is looking for ways to extend balancing using open source tools. One tool that is of interest is Great Expectations. Great Expectations is an open source python library that allows data engineers to apply business and data quality rules to batches of data, then automatically create documentation from those rules. Great Expectations could be used to apply a business rule that source row counts must match target row counts. If met, Great Expectations can send a Slack notification noting the success of the pipeline. I have some ideas about what the Slack message would say…

Thanos, data quality expert

--

--

Phil Hetzel
Slalom Data & AI

Data Engineering | Data Architecture | Data Privacy