Combine Data at Rest and Data in Motion with CDC for Apache Cassandra
In this post, we’ll introduce you to the DataStax Cassandra Source Connector for Pulsar, a new way to bridge your Apache Cassandra® database with Pulsar to combine data at rest and data in motion into a single, streamlined data platform.
When building a data platform you have to consider the broad needs of two types of consumers: those who need access to data at rest, and those who need access to data in motion.
When we think about data at rest, we typically think about making data queryable in a convenient way for a wide range of consumers. This often involves balancing the need for guaranteed consistency with the need for high performance through a combination of normalization and denormalization.
On the other hand, enterprises are increasingly facing growing demands around data in motion. This usually means that real-time data pipelines, subscribable event streams, and event stream processing have become core to an enterprise’s data platform strategy.
One area where these two aspects of data are converging is change data capture (CDC). The idea behind CDC is that you can continue to write changes to your database as you normally would, but aside from persisting these changes, an additional event is published onto a messaging platform, such as Apache Pulsar. These events can be used in a few ways:
- As a triggering mechanism to invoke arbitrary business logic in response to a change
- As a replayable audit log that can capture the history of a piece of data in your datastore
- To replicate and move data throughout your organization as part of a real-time ETL use case.
At a logical level, we want to create something that looks like the following diagram:
To accomplish this, we need four things:
- Apache Cassandra deployment
- Apache Pulsar deployment
- DataStax Cassandra Source Connector for Apache Pulsar
- DataStax Change Agent for Cassandra
Working together, these components will provide a complete solution that allows you to have the following:
- A highly-scalable NoSQL database for data at rest with Cassandra.
- A unified messaging/event streaming platform in Pulsar.
- A source connector that links the two together.
The overall implementation architecture is shown in the figure below:
This architecture will allow us to capture events that correspond to data changes in our Cassandra database. It will also use the full capabilities of Apache Pulsar to process those data change event streams. Furthermore, this solution provides sensible default configurations that address deduplication and a change stream that converges to the same eventually consistent state as the source Cassandra cluster.
Once we have this solution configured, let’s look at the common use cases and patterns that this architecture enables, and how using CDC with Cassandra can help organizations improve their ability to act upon change events in real-time.
Per-table event topics pattern
When creating an enterprise data platform, we don’t always know in advance which changed data events subscribers will need access to. As a result, one approach that enterprises often follow is to build one or more event topics for every table in the database. This ensures that any downstream applications that need to subscribe to change events will be supported by your data platform without introducing a dependent workstream for each new project that comes along.
When following this pattern, anytime a new table is created, a corresponding topic is created to capture the change data event stream for that table. This can, of course, result in a potentially large number of topics. One of the advantages of using Apache Pulsar for this use case is that topics are so cheap that they’re effectively free.
For enterprises pursuing an event-driven architecture strategy, this is a major advantage because you can effectively have a subscribable set of events for nearly every aspect of your business at the ready. Until the first consumer subscribes to the event stream, you can configure Pulsar to effectively discard the stream.
However, as soon as a subscriber needs the change events, Pulsar will start retaining the messages until the subscriber has received and acknowledged the message. This leads to a highly flexible, cost-effective approach to implementing an event-driven architecture for your data capture events in Cassandra.
Database per service pattern
One of the areas where NoSQL excels over relational databases is for write-heavy workloads. This is especially true when working with Cassandra. It’s generally better to write data into your NoSQL database in the same format you’d like to retrieve it in later.
However, this isn’t always possible because you may have multiple consumers who need access to the data in different formats for different purposes. These different formats can stem from governance rules, where data must be scrubbed for non-privileged consumers, or where data must be enriched to create a view that contains related data.
For organizations using a microservices architecture, a common pattern to address the needs of different consumers is called database-per-service. This creates a read-optimized view of data that is customized to meet the specific needs of each microservice.
As a simple example, consider the case of an order on an e-commerce system. When a user submits their order, it often contains payment information, such as credit card or bank account information. This data is obviously highly controlled and downstream services shouldn’t have access to it. One way to address this use case is to follow the database-per-service pattern and create a view of the order data for each microservice.
In the example below, a service responsible for retrieving a user’s order history needs to know details about what was ordered, but not the payment details used for that order. We might implement this pattern with a solution similar to the one shown here:
This solution ensures that the historical orders service won’t have access to sensitive data and provides a view of the data tailored to that service’s needs. One of the advantages of using Apache Pulsar for this use case is that we can leverage Pulsar Functions to handle any scrubbing, transformation, and enrichment needed to create the read-optimized view for the service. Because Pulsar also has a powerful sink connector to Apache Cassandra, the entire pipeline is straightforward to implement.
Real-time data pipelines
Enterprises have long relied on ETL (Extract, Transform, Load) processes to replicate data across various operational data stores and data warehouses. Until relatively recently, ETL processes were scheduled and followed batch processing, but this slower approach can no longer keep pace with the needs of modern enterprises.
So, the final pattern we’ll consider in this post is real-time data pipelines. Real-time data pipelines can be thought of as a generalization of the database-per-service pattern described above. Instead of optimizing a data view for a particular service, we’re creating continuously updating data views in data stores throughout the enterprise. These could be driving real-time analytics dashboards, cloud data warehouses, or traditional RDBMS systems.
With this pattern, enterprises can modernize their ETL with real-time pipelines while at the same time unifying their data platform’s approach to data in motion and data at rest. Since the transformed messages are available as an event stream, applications that need to trigger off of changes to these downstream datastores can simply subscribe to the same event stream topics used to update the downstream stores. This reduces the burden on these downstream data stores to publish another set of change events since consumers can just listen to the existing topics instead.
Ready to build your own CDC solution?
The demands on event-driven architectures and real-time data processing have resulted in an ever-increasing need to combine data at rest and data in motion into a cohesive data platform. As we’ve shown in this post, leveraging Apache Cassandra with Apache Pulsar provides a complete platform approach to this problem.
If you’re ready to get started and want to build your own CDC solution, check out our DataStax CDC for Cassandra webpage and our guide to installing Apache Pulsar using DataStax Luna Streaming. To bridge the two, try our new Cassandra Source Connector for Apache Pulsar!
If you’re interested in CDC but need a hand getting things up and running, request a meeting with one of our DataStax experts today for personalized advice that’ll put you on the right track.
Follow DataStax on Medium for exclusive posts on all things Pulsar, Cassandra, streaming, and more.