Streaming data changes to a Data Lake with Debezium and Delta Lake pipeline

Yinon Data
Israeli Tech Radar
Published in
4 min readFeb 20, 2020

To demonstrate why Debezium and Delta Lake make an interesting combo, especially for use cases involving Microservices Applications and Big Data, I’ll share a story I’ve encountered recently

Capturing data changes in multiple application instances

A Client’s Use Case

One of our clients approached me with the following story:

  • The company is making a microservices application
  • Each of their customers is running it’s own instance of the application — along with it’s own databases
  • They want to capture data changes in all of the application instances — of all their customers — to update a central Data Lake which holds a replication of the latest database state
  • Then they want to use that Data Lake for aggregations and analytics

What’s the problem?

A naive solution they attempted is to maintaine a “last_updated” timestamp column and periodically pull all changes since the last pulled timestamp.
This approach holds several shortcomings:

  • When developers don’t update the “last_updated” column — changes don’t arrive
  • The pulling process is not fault tolerant — for example on network error it wouldn’t know to pick up where it left off
  • Applicative overhead — they must maintain additional code, on top of the application’s core, to manage this task

What is Debezium?

Debezium is an open source distributed platform for change data capture. Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases. Debezium is durable and fast, so your apps can respond quickly and never miss an event, even when things go wrong.

How can Debezium help us?

Since Debezium is reading database logs:

  • We are no longer dependant on developers updating a certain column — Debezium takes care of capturing every changed row
  • No extra code needed
  • Fault tolerance when consuming the messages

Debezium records the history of data changes in Kafka logs, from where your application consumes them. This makes it possible for your application to easily consume all of the events correctly and completely. Even if your application stops (or crashes), upon restart it will start consuming the events where it left off so it misses nothing.

How does Debezium output looks like?

A json message describing the changed data
Some interesting fields:

  • ”op”: an operation code (c — create, u — update, d — delete, r — first read)
  • ”before”: the row before the change
  • ”after”: the row after the change
  • ”source”: holds some useful information about the server, database and table the change came from
Debezium output messages example

Key Question — How to transform Debezium output into a meaningful table in Data Lake context?

If only we could execute those changes on our Data Lake table, we would get the outcome we want — a replicated state of all the source databases

This is where Delta Lake comes into play. While in the past it was hard to reliablly perform UPSERTS and DELETES on data lake tables, Delta Lake, along with its other benefits, enables it. You can read more about it here

High Level Strategy Overview

  • Debezium reads database logs, produces json messages that describe the changes and streams them to Kafka
  • Kafka streams the messages and stores them in a S3 folder. We call it Bronze table as it stores raw messages
  • Using Spark with Delta Lake we transform the messages to INSERT, UPDATE and DELETE operations, and run them on the target data lake table. This is the table that holds the latest state of all source databases. We call it Silver table
  • Next we can perform further aggregations on the Silver table for analytics. We call it Gold table
Databrick’s Delta Architecture

End-To-End pipeline example project

For Tikal’s hackathon we assembled a team to build in one day an example end-to-end project that demonstrates the above pipeline

End-To-End example pipeline architecture

You can find the code along with further details in our GitHub repo

Limitations

Consider the following limitations and open questions before using Debezium and Delta Lake:

  • View / Table Joins:
    Debezium has no such notion. It knows how to capture changes in individual tables, but you can’t tell it “if this table changes — capture this whole joined view” without additional effort on your part
  • Continues streaming:
    Delta Lake’s support for streaming is limited to micro batch streams. Because it’s working with files, it doesn’t support real time continues streams

Summary

  • We saw a client’s use case involving changed data capture in Microservices Application and Big Data environments
  • We saw some of the challenges of assembling such a pipeline
  • We saw how Debezium and Delta Lake answer those challenges
  • We saw a high level strategy for assembling a pipeline with the Debezium-Delta Lake combo
  • We saw an example end-to-end pipeline

Thanks for reading!

Contact Info:
yinondrornahamu@gmail.com

--

--