CDC-based Upserts with Debezium, Apache Kafka, and Apache Pinot

How to build a streaming data pipeline to capture MySQL database changes and stream them to Apache Pinot via Debezium and Kafka

Dunith Danushka
Tributary Data

--

Photo by T K on Unsplash

Upserting means inserting a record into a database if it does not already exist or updating it if it does exist. Analytics database at the end of a streaming data pipeline can benefit from upserts to maintain the data consistency with the source database.

This article explores a minimal viable setup for a streaming data pipeline that captures changes from MySQL and streams them to Apache Pinot via Debezium and Apache Kafka. You can find several videos on the same topic. But this article gives you a solid blueprint to start building your CDC pipeline at scale.

Why do we need upserts?

A real-time analytics system consists of several sub-systems working together to derive insights from events flowing through them.

Change data capture (CDC) tools such as Debezium capture changes in transactional databases, transform them as events and streams them…

--

--

Dunith Danushka
Tributary Data

Editor of Tributary Data. Technologist, Writer, Senior Developer Advocate at Redpanda. Opinions are my own.