How to integrate legacy systems with event-driven architecture?

Published in

CodeX

6 min readAug 21, 2021

When you are in a situation that you have designed a system with a new architecture, but there are other systems with older architecture and structure, if it is required, it is your responsibility to design a way to integrate these systems so that could continue to collaborate with each other well.

Suppose that you have some systems with micro-service architecture which use event-driven design to decouple communication between services. Also, there are other systems with legacy architecture and design that their data are used in other systems such as services in the new microservice-based ecosystem.

What possible approaches do you have in your hand to decide about the integration strategy with the new event-driven system?

Of course refactoring to a clean CQRS pattern would be a great solution, but it comes with cost (time & money),so what can we do instead of it?

In this short essay, I will try to make a brief introduction of some approaches to make a correct decision in this situation.

Query-Based approach

In this approach, you extract data from the underlying data store using queries in a periodic time interval. A client app is responsible to send query requests to the underlying data store such as relational databases using polling in a predefined interval, and then convert it to the specified event and send it to the event broker for processing.

One way to query is Bulk loading, which loads all of the data in the table from the data store at each interval. Of course, it is more expensive in a large data set, so that’s better to find optimized solutions.

Instead of loading all the data, we have to load part of the data incrementally in each interval, for doing that we could follow one of these approaches:

The first approach is using a timestamp field such as the uprate-at field in your table. In this, you have to store the last timestamp and in every interval, you load all the data since the highest timestamp of the latest result.

Another approach is using an auto-increment id in your table. So you could start to fetch records from the latest Id you have fetched in the latest result. This is better than the timestamp but it is required to have an auto-increment id field.

Change-data capture log approach

Change Data Capture is a software process that identifies and tracks changes to data in a source database. CDC provides real-time or near-real-time movement of data by moving and processing data continuously as new database events occur.

In high-velocity data environments where time-sensitive decisions are made, Change Data Capture is an excellent fit to achieve low-latency, reliable, and scalable data replication.

Log-based change data capture is the best method for CDC. Databases contain transaction logs (also called redo logs) that store all database events allowing for the database to be recovered in the event of a crash. With log-based change data capture, new database transactions — including inserts, updates, and deletes — are read from source databases’ native transaction logs. This append-only log contains all of the information about everything that has happened to the tracked data sets over time. Actually, not all datastores implement immutable logging of changes, so there is some limitation to use in comparison to the prior approach.

There are a number of options available for sourcing data from changelogs, Debezium is one of the most popular choices for relational databases, as it supports the most common ones. Debezium can produce records to both Apache Kafka and Apache Pulsar with its existing implementations. Support for additional brokers is certainly possible, though it may require some in-house development work. Maxwell is another example of a binary log reader option, though it is currently limited in support to just MySQL databases and can produce data only to Apache Kafka.

A Kafka Connect service, running a Debezium connector, is consuming the raw binary log. Debezium parses the data and converts it into discrete events. Next, an event router emits each event to a specific event stream in Kafka, depending on the source table of that event. Downstream consumers are now able to access the database content by consuming the relevant event streams from Kafka.

Kafka Connect is focused on streaming data to and from Kafka, making it simpler for you to write high-quality, reliable, and high-performance connector plugins.

Kafka Connect includes two types of connectors: Source connector — Ingests entire databases and streams table updates to Kafka topics. Sink connector — Delivers data from Kafka topics into another data store.

Outbox table approach

In this approach, along with every command in the database, you have to insert some event data to an outbox table in the same transaction. Like the first approach, it is needed to a timely fashion job read events from the outbox table and send them to the event broker.

Conclusion

The customizability of query-based approaches gives it great flexibility to query and fetch complex data combinations. Also, it provides Isolation of internal data models so that it could hide domain model information that should not be exposed outside of the data store. But it has some disadvantages such as maintenance of queries with schema change is a big problem, queries use the underlying system resources to execute, which can cause unacceptable delays on a production system.

On the other side, in the CDC approach, for data stores that use write-ahead and binary logs, change-data capture can be performed without any impact on the data store’s performance and updates can be propagated as soon as the event is written to the log. Also, the CDC approach comes with some disadvantages such as denormalization must occur outside of the data store, This may lead to the creation of highly normalized event streams, requiring downstream micro-services to handle foreign-key joins and denormalization. Another thing is that this approach doesn’t support every database technology.

In the Outbox table approach, you can use it for any client or framework that exposes transactional capabilities. Also, you can keep internal fields isolated but it is impossible in the CDC approach and Data can be denormalized as needed before being written to the outbox table.

Of course, it has some drawbacks such as application code must be changed to enable this pattern, and performance impact to the business workflow may be nontrivial, particularly when validating schemas via serialization, also, the performance impact to the data store may be nontrivial, especially when a significant quantity of records are being written, read, and deleted from the outbox.

Resources

- Building event-driven microservice, Leveraging Organizational Data at Scale, Adam Bellemare

- https://docs.confluent.io/platform/current/connect/index.html

- -https://www.striim.com/change-data-capture-cdc-what-it-is-and-how-it-works/

- https://www.striim.com/log-based-change-data-capture/

How to integrate legacy systems with event-driven architecture?

Written by Mortaza Ghahremani