Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline

Published in

Israeli Tech Radar

4 min readFeb 20, 2020

To demonstrate why Debezium and Delta Lake make an interesting combo, especially for use cases involving Microservices Applications and Big Data, I’ll share a story I’ve encountered recently

Capturing data changes in multiple application instances

A Client’s Use Case

One of our clients approached me with the following story:

The company is making a microservices application
Each of their customers is running it’s own instance of the application — along with it’s own databases
They want to capture data changes in all of the application instances — of all their customers — to update a central Data Lake which holds a replication of the latest database state
Then they want to use that Data Lake for aggregations and analytics

What’s the problem?

A naive solution they attempted is to maintaine a “last_updated” timestamp column and periodically pull all changes since the last pulled timestamp.
This approach holds several shortcomings:

When developers don’t update the “last_updated” column — changes don’t arrive
The pulling process is not fault tolerant — for example on network error it wouldn’t know to pick up where it left off
Applicative overhead — they must maintain additional code, on top of the application’s core, to manage this task

What is Debezium?

Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.

How can Debezium help us?

Since Debezium is reading database logs:

We are no longer dependant on developers updating a certain column — Debezium takes care of capturing every changed row
No extra code needed
Fault tolerance when consuming the messages

Debezium records the history of data changes in Kafka logs, from where your application consumes them. This makes it possible for your application to easily consume all of the events correctly and completely. Even if your application stops (or crashes), upon restart it will start consuming the events where it left off so it misses nothing.

How does Debezium output looks like?

A json message describing the changed data
Some interesting fields:

”op”: an operation code (c — create, u — update, d — delete, r — first read)
”before”: the row before the change
”after”: the row after the change
”source”: holds some useful information about the server, database and table the change came from

Key Question — How to transform Debezium output into a meaningful table in Data Lake context?

If only we could execute those changes on our Data Lake table, we would get the outcome we want — a replicated state of all the source databases

This is where Delta Lake comes into play. While in the past it was hard to reliablly perform UPSERTS and DELETES on data lake tables, Delta Lake, along with its other benefits, enables it. You can read more about it here

High Level Strategy Overview

Debezium reads database logs, produces json messages that describe the changes and streams them to Kafka
Kafka streams the messages and stores them in a S3 folder. We call it Bronze table as it stores raw messages
Using Spark with Delta Lake we transform the messages to INSERT, UPDATE and DELETE operations, and run them on the target data lake table. This is the table that holds the latest state of all source databases. We call it Silver table
Next we can perform further aggregations on the Silver table for analytics. We call it Gold table

End-To-End pipeline example project

For Tikal’s hackathon we assembled a team to build in one day an example end-to-end project that demonstrates the above pipeline

End-To-End example pipeline architecture

You can find the code along with further details in our GitHub repo

Limitations

Consider the following limitations and open questions before using Debezium and Delta Lake:

View / Table Joins:
Debezium has no such notion. It knows how to capture changes in individual tables, but you can’t tell it “if this table changes — capture this whole joined view” without additional effort on your part
Continues streaming:
Delta Lake’s support for streaming is limited to micro batch streams. Because it’s working with files, it doesn’t support real time continues streams

Summary

We saw a client’s use case involving changed data capture in Microservices Application and Big Data environments
We saw some of the challenges of assembling such a pipeline
We saw how Debezium and Delta Lake answer those challenges
We saw a high level strategy for assembling a pipeline with the Debezium-Delta Lake combo
We saw an example end-to-end pipeline

Thanks for reading!

Contact Info:
yinondrornahamu@gmail.com