Yotpo Engineering
Published in

Yotpo Engineering

Building zero-latency data lake using Change Data Capture

Introduction

Tracking database changes with CDC

CDC with Kafka and Metorikku

Figure 1: Change Data Capture with Metorikku

Debezium (Kafka Connect)

Why Avro is Awesome

Schema Registry

Apache Hudi Storage Format

Metorikku

steps:
- dataFrameName: cdc_filtered
sql:
SELECT ts_ms,
op,
before,
after
FROM cdc_events
WHERE op IN ('d', 'u', 'c')
- dataFrameName: cdc_by_event
sql:
SELECT ts_ms,
CASE WHEN op = 'd' THEN before ELSE after END AS cdc_fields,
CASE WHEN op = 'd' THEN true ELSE false END AS _hoodie_delete
FROM cdc_filtered
- dataFrameName: cdc_with_fields
sql:
SELECT ts_ms,
cdc_fields.*,
_hoodie_delete
FROM cdc_by_event
- dataFrameName: cdc_table
sql:
SELECT *,
id AS hoodie_key,
from_unixtime(created_at, 'yyyy/MM/dd') AS hoodie_partition
FROM cdc_with_fields
output:
- dataFrameName: cdc_table
outputType: Hudi
outputOptions:
path: table_view.parquet
keyColumn: hoodie_key
timeColumn: ts_ms
partitionBy: hoodie_partition
hivePartitions: year,month,day
tableName: hoodie_test
saveMode: Append

Monitoring

Figure 2: Monitoring Kafka Connect

Looking Forward

--

--

We're the Engineering Department of Yotpo, we share our insights and thoughts on developing and running large scale web and data applications.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store