How to Build your First Real-Time Streaming(CDC) system(Introduction-Part 1)

Published in

Analytics Vidhya

6 min readMar 10, 2020

Introduction

With the exponential growth of data and a lot of Business moving online, it has become imperative to Design systems that can act in real-time or near real-time to make any business decisions. So after working on multiple backend projects through many years I finally got to do build a real-time streaming platform. And while working on the project I did start experimenting on the different tech stacks to deal with this. So I am trying to share my learnings in a series of articles. Here is the first of them.

Also, in this series, the main focus will be on how-to rather than how-does-it. We’ll spend most of the time learning how to implement our use case. However, I will cover some theory as and when required.

Target Audience:

This post is aimed at engineers who are already familiar with microservices, java programming language and looking to build their first real-time streaming pipeline. This POC is divided into 4 articles for the purpose of readability. They are as below:-

The first article(the current one) contains the basic concepts, the problem statement, the tech stack and everything that we need to know about what we are building.
Article 2(System Setup)will contain the steps to set up the MySql database for Binlogs and docker infrastructure setup. We also cover how to create Kafka topics using Kafka connect.
Article 3(Kafka Streams and Aggregations) will contain the java backend code where we listen to the Kafka topics to create/update aggregated indices into Elasticsearch.
Article 4(Verification and Running your code) will contain steps about using Kibana to download the final aggregated report and running the code in your local and verifying the real-time streaming system using Kibana.

What to Expect:

A proof of concept on how to build a real-time streaming(CDC) pipeline. There would be efforts needed in terms of optimization on the same to make it production-ready.

Problem Statement

Imagine that you work for an E-Commerce company selling some fashion products. You have the business team that wants to make some decisions based on some real-time updates that are made on the backend systems. They want to view some dashboards and views for the same. So let's assume that the backend is built on microservice architecture and you have multiple systems interacting with each other during any user operation and every system interacts with different databases.

Let's consider three of the systems for our POC.

So for this example, the business team wants to create an outbound dashboard. This contains the number of orders, types of items sold, cost of each item and cost of shipping the items. This would be constantly updated at any point in time in multiple systems based on user and real-world actions as seen below.

Once the customer has selected and the payment is made, the order goes to the warehouse and from there the item is sent to the customer. Consider that after each action, one system would be updated in real-time.

Let's have a look at the kind of data required by the business team:-

Units shipper per day per item type per the warehouse.
Orders shipped per day with each courier company.
The total cost of items shipped per day per the warehouse.

We know that this data is present in Order Service, Warehouse service and logistics service. Assume all of them are using the MySQL database and all of them are updated in real-time. So now that we have looked at the use case, let's think about what could our solution to this problem.

Possible Solution: We need to figure out a way to capture all the updates/inserts that happen in the different databases in different services and put it in a single place from where we can work on building some reports and working on some analytics. So this is where Kafka Connect and Debezium comes in.

Understanding CDC, Mysql Binlogs, Debezium, and some basic concepts:

So before we jump into the implementation of this system we will need to understand a few concepts. They are as below:-

Change data capture:

Change Data Capture (CDC) tracks data changes (usually close to realtime). CDC can be implemented for various tasks such as auditing, copying data to another system or processing (and reacting to) events. In MySQL, the easiest and probably most efficient way to track data changes is to use binary logs.

MySQL Binlogs

The binary log is a set of log files that contain information about data modifications made to a MySQL server instance. The log is enabled by starting the server with the --log-bin option.

The binary log was introduced in MySQL 3.23.14. It contains all statements that update data. It also contains statements that potentially could have updated it (for example, a DELETE which matched no rows), unless row-based logging is used. Statements are stored in the form of "events" that describe the modifications.

OpLogs:

Similar to MySQL Binlogs MongoDB has something called Oplogs which are similar to bin logs and are used for CDC.

Debezium:

Debezium is a set of distributed services to capture changes in your databases so that your applications can see those changes and respond to them. Debezium records all row-level changes within each database table in a change event stream, and applications simply read these streams to see the change events in the same order in which they occurred.

Stream Processing, Real-time processing and Complex event processing

Before we dive into the problem statement one thing that I did want to put forth was that in this POC I am building a Complex Event Processing (CEP) system. There are very subtle differences between stream processing, real-time processing, and complex event processing. They are as below:-

Stream Processing: Stream processing is useful for tasks like fraud detection and cybersecurity. If transaction data is stream-processed, fraudulent transactions can be identified and stopped before they are even complete.
Real-time Processing: If event time is very relevant and latencies in the second’s range are completely unacceptable then it’s called Real-time (Rear real-time) processing. Eg. flight control system for space programs
Complex Event Processing (CEP): CEP utilizes event-by-event processing and aggregation (for example, on potentially out-of-order events from a variety of sources, often with large numbers of rules or business logic).

This depends more on the business use case. The same tech stack can be used for stream processing as well as real-time processing.

Tech Stack:

So now that we do have an overall picture in terms of what we want to achieve and what are the concepts that are involved, the next step is to understand the overall technical tasks and the tech stack that we would be using to build our system. Let's take a look at the same.

Let’s dive into the tech stack that we would be using for this POC. They are as below:-

MySQL 8.0.18
Apache Kafka connect 1.1
Apache Kafka
Apache Kafka Streams 2.3.1
Elastic search 7.5.1
Kibana 7.5.1
Docker 19.03.5
Swagger UI
Postman 7.18.1

Overall Technical Tasks:

So in terms of overall tasks, we will be splitting them up as below:-

Setting up local infrastructure using docker.
Data Ingestion into Kafka from MySQL database using Kafka connect.
Reading data using Kafka streams in Java backend.
Creating indices on Elasticsearch for the aggregated views.
Listening to the events in real-time and updating the same.
Setting up your local and running the java code

Summary and Next Steps:

So in this article, we looked at the below:-

We understood the problem statement and looked at the possible solutions.
Understood the different concepts required for our solution.
We looked at the tech stack for the solution

Next steps: In the next article, we will look at how to setup MySQL database for BinLogs to be written and docker setup to create the required infrastructure in our local system.

If you do like this article, please do read the subsequent articles and share your feedback. Find me on LinkedIn at rohan_linkedIn.