Rivery Change Data Capture: Lessons learned from building a CDC solution from scratch

Change Data Capture (CDC)

Lessons learned building a solution from scratch

Rivery
rivery-blog
Published in
10 min readAug 16, 2022

--

By Alon Reznik, Co-Founder & Chief Architect at Rivery

Once upon a time, in a data pipeline not so far away

At Rivery, we’re all about data, specifically managing the data pipeline lifecycle.

We move data from point A to point B, but we also focus on moving updates and changes from point A to point B in the most efficient and fastest way possible.

To optimize this process, we chose Change Data Capture (CDC) as the best solution to help our customers take care of these updates and changes.

Initially, we selected a well-regarded, open-source CDC solution, but it ended up not working out. There were a number of hurdles that we had to overcome, which ultimately made this option a non-starter.

If you’re looking to deploy a CDC platform, then it’s worthwhile to read through some of the challenges we faced and learn from our experience, so you can be more confident in the solution you choose to support your specific needs.

First, let’s take a quick look at why CDC is so important. Then, we’ll dive into why the platform we chose wasn’t optimal for our needs, and what we did to make CDC work for us and our customers.

Why CDC in the first place?

The need for replication speed

As data engineers and analysts, we need to move data from our relational databases (point A), such as SQL Server or MySQL, to data warehouses, data lakes, or other target databases (point B).

Whenever there’s a change or update in the database, we also need to sync between the two in as close to real time as possible, and with as little friction and complexity as possible.

Traditionally, doing this has been achieved through batch data replication; executing once or several times a day. But batch data pulling requires additional computing, provides insufficient inputs on the history of deleted rows, and entails higher latencies.

When it comes to data replication, data engineers and analysts need a different approach.

In comes a different approach

To reduce overhead, eliminate latency, and enable real-time analytics, organizations have moved away from batch or bulk load updating to incremental updating with change data capture.

Basically, CDC speeds up data processing by eliminating the need for full-scale database replication in the ETL/ELT pipeline, and creating an analytics database as a separate copy of the production database.

This process entails identifying and capturing any data change (e.g. insert, updates, deletes) to the database logs in real time at point A, using the database engine’s native API, and then delivering changes to point B.

Because it only deals with changes to the logs, it eliminates the need for ongoing database replication using the database engine, thereby minimizing the resources required for ETL/ELT processes.

And since it deals with new database events as they occur, it enables real-time or near-real-time data movement.

As such, CDC is ideal for near-real-time business intelligence as well as for cloud migrations.

The Debezium path to CDC

At first, we decided to go with Debezium. Among the different CDC tools out there, with 7K GitHub stars and 1.8K GitHub forks, Debezium is one of the most popular.

This is an open-source distributed platform that’s built on top of Apache Kafka, where you can:

“Start it up, point it at your databases, and your apps can start responding to all of the inserts, updates, and deletes that other apps commit to your databases.” (Debezium)

The advantages of Debezium are said to include:

  • Durability and speed
  • Scalability in handling large volumes of data
  • Incubating connectors for MySQL, SQL Server, Postgres, and Oracle
  • Community sink connectors for ElasticSearch
  • Sink connectors for MySQL and SQL Server
  • MongoDB connector maturity
  • Fault tolerance

And, of-course, the fact that it’s open-source was an additional advantage for us. So, we went with Debezium.

Our short affair with Debezium

We were feeling good. Optimistic. Debezium is known for being great and we want nothing less than the best for our customers.

We implemented Debezium, which we were planning to manage on the backend as our customers were to enjoy its benefits when CDC was required by processes executed by our platform on the front end.

But, sadly, Debezium just didn’t work for our needs. Some of the main issues we faced include:

Running after open source version updates

Embedding a core solution which is open source into your own product is challenging. You need to run over any new version, track the changes, and make sure your product is stable in light of all these changes, all while reviewing other prioritizations, fixing velocity, and implementing security patches. So, while Debezium is a well-regarded solution for some cases, for us, the overhead from version upgrades and the impact on our customers, made the effort greater than expected.

Error messages

With Debezium, as with any solution that’s built on Kafka Connect, error messages provide the full Java stack, without the option to search for the root cause of the underlying message. This lack of visibility made remediation very difficult, and sharing with our customers what the real error was behind the message, was quite challenging.

Scale

On Debezium and Kafka Connect, all pipelines run on the same connectors. The result is a round robin among the pipelines that makes scale complex and often out of reach as the number of sources grows.

Code transparency

Debezium uses Kafka Connect to move data into Kafka. The issue with Kafka Connect is that it doesn’t come with the required level of code transparency, which also complicates the maintenance of CDC processes for hundreds of different types of configurations that our customers have executed in their own databases.

Logging

Debezium logs aren’t aligned with our formatting, nor did they provide the essential information that we needed. To illustrate, every row in the error stack is provided as one row in the logs. This makes tracking very challenging. Some logs are even hidden in certain use cases, while in other cases, there’s over-logging.

Topologies support

While Debezium does offer support for certain topologies, it doesn’t support the hundreds of different types of topologies that are used by our customers.

SSH tunnels

Whenever a customer organization was using SSH tunnels for connectivity, Debezium couldn’t support it.

We needed (and ultimately created) a solution that would connect to our client’s databases using different topologies, such as SSH tunnel, SSL, and VPN tunnels. SSH tunnels are commonly used to connect remote databases in internal VPCs and unfortunately, this isn’t embedded out of the box in Debezium.

Use cases

The biggest issue we experienced is that Debezium use cases couldn’t be tailored to our customers’ needs. For example, we received regular feedback that the open platform simply wasn’t working with specific data types and tables from our customers.

What should you evaluate when considering a CDC platform?

  • Frequent version updates: will you be able to handle regular merges and heavy maintenance?
  • Error messages: does the platform deliver clarity for quick and efficient remediation?
  • Scale: does the platform work with all of your database types?
  • Code transparency: can you easily get into the code to make changes when needed?
  • Logging: do the log files provide the information you need?
  • Topologies support: does it support your topology?
  • SSH tunnels: does it support SSH connectivity and other connectivity best practices?
  • Use cases: can it handle all your use cases, tables, etc.?

Unfortunately, the answer to all of the above was ‘no’ for us. As a result, we decided to take things into our own hands and develop our own CDC solution.

A new approach to CDC

Our goal was to create another solution, our own platform, which we could intuitively know how to maintain and debug. When the code is yours, you have access to it, you have visibility, and you know how to handle any issue that might arise.

But building your own CDC platform is no simple task. It requires a lot of knowledge about so many different possible specifications that need to be applied to just about every database.

It’s not just about knowing your own data, database, tables, and topology. It’s about knowing everything about them, and how to get data from point A to point B seamlessly, quickly, and efficiently.

But for us, and for our customers, the advantages of creating our own CDC platform far outweighed the challenges.

So, what did we do?

We mapped out all the issues we were experiencing, aggregated insights, assembled our best and brightest for the task, and built our own CDC platform as follows.

For the programming language, we evaluated and considered Python and Go.

In our consideration, we quickly realized that Python doesn’t support multithreading well enough, nor does it offer a robust solution for Kafka. Since multithreading is so important for processing many events at the same time (a pillar of CDC), and since our solution is on Kafka, we chose not to go with Python as our programming language.

This led us to choose Go, which offers many advantages:

  • Multithreading out of the box (using Goroutines).
  • Task queues using Go channels for concurrency out of the box.
  • Great support for I/O.
  • Stability
  • Flexibility
  • Very Light.

The architecture

We built an architecture that drives high rates of CDC efficiency. For example, the consumer pushes data log entries to queues and everything is orchestrated by the manager, which also performs validation for MySQL and creates a status report if something doesn’t fit or connect.

This also drives great flexibility, enabling a new connector to be deployed in less than two weeks, as opposed to the three weeks that are typically required.

SSH connectivity and other topologies

To support our customers’ connectivity needs, we created our own SSH tunnel as an external service that we embedded as a side car.

This new solution provides us with the ability to align our clients’ connecting topologies and offers better, higher performance, and more stable solutions for their specific needs.

Centralized management

The consumer in this architecture can push data to Kafka. It’s a system that can get messages at a very high capacity; understanding how to bring a lot of different data types quickly from a lot of different sources, all with the ease-of-use and clarity of central management.

A transactional process against Kafka

When it comes to pushing data to Kafka, the execution needs to be transactional through the Kafka API. So, we needed to create a transactional process that would collect all the data that we needed to push, ensuring the process would also let the user know where this collection started and where exactly it would be pushed.

This was the hard part, but it’s exactly what we did!

Having such visibility into the process helps you to avoid delays. When events interrupt the consuming, it’s hard to know where you started, where you stopped, and where you need to pick up again. Ultimately, it can cause a lot of problematic delays.

Going back and forth enables us to manage any change in our customers’ databases and make sure we’re catching all of the messages in a quick and easy way.

The flip/flop functionality

Knowing where you started and where you left off is also key to the flip/flop functionality that we designed and implemented.

We created two queues instead of one, so that we could make sure that messages are accepted by Kafka, while at the same time, we could fetch more data from the source without causing any of the processes to be blocked.

Whenever one queue gets filled up, the data coming in goes into the second empty queue instead of creating a backlog. Only once the full queue has emptied out, would we get back to using it for new data coming in.

This flip/flop functionality is only possible when you know where you started, where you are, and where you left off, so that you only fetch the data you need from the database without any unnecessary replications.

The benefit of this approach is not only about avoiding delays. Although this was a fantastic benefit as we reduced delays down to about 10 seconds vs. the 2–3 minutes that we had with Debezium when we opened it for scale.

Additional benefits include:

  • Eliminating the need for redundancy
  • Greater stability
  • Scalability out of the box

And the biggest benefit of all was that we could finally focus on the business logic and address our customers’ needs, instead of expending tons of energy on database replication.

No overhead. Just value.

Closing thoughts on Change Data Capture and choosing the best CDC platform for your data needs

If you need CDC to help streamline, accelerate, and drive efficiency in synchronizing your data, you could go with a popular platform, but it may not be able to meet all of your needs out of the box.

You could design your own, but keep overhead front of mind. Would you really want to take all that on? Especially when we’ve already done the heavy lifting for you? 🏋️‍♂️

With Rivery’s CDC solution, you get some of the toughest challenges pre-solved for you:

  • A language that’s very light and supports multithreading out of the box with compatibility for multiple channels, I/O support, and stability
  • A flexible architecture that drives higher rates of efficiency
  • Built-in SSH connectivity
  • A better transactional process that offers more visibility and minimizes delays
  • Flip/flop functionality with two queues, which eliminates the need for redundancy, provides greater stability, and delivers scalability out of the box
  • Frees up valuable time and resources to focus on business logic instead

--

--

Rivery
rivery-blog

Stay up to date on all things data pipelines, data transformation and data automation