Combine Data at Rest and Data in Motion with CDC for Apache Cassandra

Authors: Chris Latimer, Swetha Polamreddy

In this post, we’ll introduce you to the DataStax Cassandra Source Connector for Pulsar, a new way to bridge your Apache Cassandra® database with Pulsar to combine data at rest and data in motion into a single, streamlined data platform.

When building a data platform you have to consider the broad needs of two types of consumers: those who need access to data at rest, and those who need access to data in motion.

When we think about data at rest, we typically think about making data queryable in a convenient way for a wide range of consumers. This often involves balancing the need for guaranteed consistency with the need for high performance through a combination of normalization and denormalization.

On the other hand, enterprises are increasingly facing growing demands around data in motion. This usually means that real-time data pipelines, subscribable event streams, and event stream processing have become core to an enterprise’s data platform strategy.

One area where these two aspects of data are converging is change data capture (CDC). The idea behind CDC is that you can continue to write changes to your database as you normally would, but aside from persisting these changes, an additional event is published onto a messaging platform, such as Apache Pulsar. These events can be used in a few ways:

  • As a triggering mechanism to invoke arbitrary business logic in response to a change
  • As a replayable audit log that can capture the history of a piece of data in your datastore
  • To replicate and move data throughout your organization as part of a real-time ETL use case.

Solution architecture

At a logical level, we want to create something that looks like the following diagram:

Figure 1: Architecture of a cohesive data platform.

To accomplish this, we need four things:

  1. Apache Cassandra deployment
  2. Apache Pulsar deployment
  3. DataStax Cassandra Source Connector for Apache Pulsar
  4. DataStax Change Agent for Cassandra

Working together, these components will provide a complete solution that allows you to have the following:

  • A highly-scalable NoSQL database for data at rest with Cassandra.
  • A unified messaging/event streaming platform in Pulsar.
  • A source connector that links the two together.

The overall implementation architecture is shown in the figure below:

Figure 2: Architecture of CDC for Cassandra.

This architecture will allow us to capture events that correspond to data changes in our Cassandra database. It will also use the full capabilities of Apache Pulsar to process those data change event streams. Furthermore, this solution provides sensible default configurations that address deduplication and a change stream that converges to the same eventually consistent state as the source Cassandra cluster.

Once we have this solution configured, let’s look at the common use cases and patterns that this architecture enables, and how using CDC with Cassandra can help organizations improve their ability to act upon change events in real-time.

Per-table event topics pattern

When creating an enterprise data platform, we don’t always know in advance which changed data events subscribers will need access to. As a result, one approach that enterprises often follow is to build one or more event topics for every table in the database. This ensures that any downstream applications that need to subscribe to change events will be supported by your data platform without introducing a dependent workstream for each new project that comes along.

Figure 3: Diagram showing an event topics pattern per table.

When following this pattern, anytime a new table is created, a corresponding topic is created to capture the change data event stream for that table. This can, of course, result in a potentially large number of topics. One of the advantages of using Apache Pulsar for this use case is that topics are so cheap that they’re effectively free.

For enterprises pursuing an event-driven architecture strategy, this is a major advantage because you can effectively have a subscribable set of events for nearly every aspect of your business at the ready. Until the first consumer subscribes to the event stream, you can configure Pulsar to effectively discard the stream.

However, as soon as a subscriber needs the change events, Pulsar will start retaining the messages until the subscriber has received and acknowledged the message. This leads to a highly flexible, cost-effective approach to implementing an event-driven architecture for your data capture events in Cassandra.

Database per service pattern

One of the areas where NoSQL excels over relational databases is for write-heavy workloads. This is especially true when working with Cassandra. It’s generally better to write data into your NoSQL database in the same format you’d like to retrieve it in later.

However, this isn’t always possible because you may have multiple consumers who need access to the data in different formats for different purposes. These different formats can stem from governance rules, where data must be scrubbed for non-privileged consumers, or where data must be enriched to create a view that contains related data.

For organizations using a microservices architecture, a common pattern to address the needs of different consumers is called database-per-service. This creates a read-optimized view of data that is customized to meet the specific needs of each microservice.

As a simple example, consider the case of an order on an e-commerce system. When a user submits their order, it often contains payment information, such as credit card or bank account information. This data is obviously highly controlled and downstream services shouldn’t have access to it. One way to address this use case is to follow the database-per-service pattern and create a view of the order data for each microservice.

In the example below, a service responsible for retrieving a user’s order history needs to know details about what was ordered, but not the payment details used for that order. We might implement this pattern with a solution similar to the one shown here:

Figure 4: Example of database-per-service used in an e-commerce setting.

This solution ensures that the historical orders service won’t have access to sensitive data and provides a view of the data tailored to that service’s needs. One of the advantages of using Apache Pulsar for this use case is that we can leverage Pulsar Functions to handle any scrubbing, transformation, and enrichment needed to create the read-optimized view for the service. Because Pulsar also has a powerful sink connector to Apache Cassandra, the entire pipeline is straightforward to implement.

Real-time data pipelines

Enterprises have long relied on ETL (Extract, Transform, Load) processes to replicate data across various operational data stores and data warehouses. Until relatively recently, ETL processes were scheduled and followed batch processing, but this slower approach can no longer keep pace with the needs of modern enterprises.

So, the final pattern we’ll consider in this post is real-time data pipelines. Real-time data pipelines can be thought of as a generalization of the database-per-service pattern described above. Instead of optimizing a data view for a particular service, we’re creating continuously updating data views in data stores throughout the enterprise. These could be driving real-time analytics dashboards, cloud data warehouses, or traditional RDBMS systems.

Figure 5: Diagram showing a real-time data pipeline.

With this pattern, enterprises can modernize their ETL with real-time pipelines while at the same time unifying their data platform’s approach to data in motion and data at rest. Since the transformed messages are available as an event stream, applications that need to trigger off of changes to these downstream datastores can simply subscribe to the same event stream topics used to update the downstream stores. This reduces the burden on these downstream data stores to publish another set of change events since consumers can just listen to the existing topics instead.

Ready to build your own CDC solution?

The demands on event-driven architectures and real-time data processing have resulted in an ever-increasing need to combine data at rest and data in motion into a cohesive data platform. As we’ve shown in this post, leveraging Apache Cassandra with Apache Pulsar provides a complete platform approach to this problem.

If you’re ready to get started and want to build your own CDC solution, check out our DataStax CDC for Cassandra webpage and our guide to installing Apache Pulsar using DataStax Luna Streaming. To bridge the two, try our new Cassandra Source Connector for Apache Pulsar!

If you’re interested in CDC but need a hand getting things up and running, request a meeting with one of our DataStax experts today for personalized advice that’ll put you on the right track.

Follow DataStax on Medium for exclusive posts on all things Pulsar, Cassandra, streaming, and more.

References

  1. Download: DataStax Cassandra Source Connector for Apache Pulsar
  2. Download: DataStax Change Agent for Cassandra
  3. Quickstart guide to installing Apache Cassandra
  4. Quickstart guide to installing Apache Pulsar using DataStax Luna Streaming
  5. Apache Pulsar Functions Overview

--

--

--

We’re huge believers in modern, cloud native technologies like Kubernetes; we are making Cassandra ready for millions of developers through simple APIs; and we are committed to delivering the industry’s first and only open, multi-cloud serverless database: DataStax Astra DB.

Recommended from Medium

How to Troubleshoot WSO2 Products — 2.

Quickly Transform Huge CSV Files — Using AWS Lambda with Amazon S3

How to Verify a Signed Message in Solidity

Leveraging Virtual Tables in Apache Cassandra 4.0

How to Integrate an ERP to Shopware 6 eCommerce Platform? Come in for some tips!

📣ANNOUNCEMENT: MRS DEPOSIT/WITHDRAWAL SYSTEM IS NOW ACTIVE

Foobar challenge “Queue To Do” solution explained

Real Life Examples of Pandas — Ẹ Gbà Ẹ Tọwó

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DataStax

DataStax

DataStax is the company behind the massively scalable, highly available, cloud-native NoSQL data platform built on Apache Cassandra®.

More from Medium

Apache Pulsar Performance Testing with NoSQLBench

Customizing the Apache Heron Execution Environment

Topology Evolution

Using a NewSQL DBMS to Improve Data Freshness and Execute Analytical Queries in Minutes