Stream Your Data Everywhere with Astra DB, CDC, and Astra Streaming
Author: Sebastian Estevez
Sending the data in your Apache Cassandra® NoSQL database to downstream systems has never been easier. With DataStax’s Change Data Capture (CDC) and Streaming support for Astra Terraform provider, you can now configure CDC into target systems by simply running “terraform apply”. Read on to see it in action as we stream data to Snowflake from a Cassandra database.
Back in March of this year, DataStax released Change Data Capture (CDC) for DataStax Astra DB, the multi-cloud database-as-a-service built on Apache Cassandra®. In a nutshell, CDC gives you the ability to subscribe to changes in your Astra DB tables and forward them onto several target systems in real time.
Now, we’re announcing CDC and Streaming support for the Astra Terraform provider. By simply running terraform apply
, you can configure CDC into other systems.
In this post, we’ll give you a peek under the hood and an example of using it to send data downstream from Cassandra to Snowflake. You can also see the full example of how to enable DataStax CDC for Snowflake with Terraform on GitHub.
Architectural implications
Many Cassandra users have long been forced to publish data to a queue before sending their Cassandra data to multiple systems. To make this work, they had to configure, manage, and maintain their own queueing system or message bus and their own sinks.
Now, Cassandra users leveraging Astra DB can pick their target system(s) of choice and forward their live data downstream with a simple terraform apply
. This allows users to write directly from their app to their system of record (Cassandra/Astra DB) without a durable event stream, which introduces complex asynchronous guarantees and eventual consistency.
In this new CDC-enabled architecture, you can leverage DataStax Astra Streaming to forward the data to your downstream systems. Astra Streaming is a managed service for Apache Pulsar™ with no operational burden, providing built-in sinks for many of the most common target systems. Terraform will take care of the end-to-end cloud infrastructure and configuration, including Astra DB, CDC, Astra Streaming, and the target systems themselves.
The integration with Terraform
We added support in the astra-terraform provider for Astra Streaming and CDC to enable scenarios that can be easily deployed to the cloud, like the example below.
Terraform is an infrastructure as code (IaC) system that allows users to spin up and configure cloud resources based on a declarative description of their infrastructure. For example, you can define an Astra database in your resources.tf
file as follows:
A Terraform resource represents a bit of cloud infrastructure. As you can see in the resource above, the astra_database
resource has an ID dev and a name, a keyspace, a cloud provider, and region. Once you’ve installed Terraform, you can run terraform apply
from the same directory to create and configure a database.
Streaming data to Snowflake from Cassandra
Why would Cassandra users even want to send their data to other systems? It shouldn’t be surprising that Online Transaction Processing (OLTP) data in Cassandra driving online applications can often be used for other applications.
In this case, let’s imagine a user wants to perform OLAP/analytical queries on their Cassandra data using Snowflake. In addition to simply creating a database, you can also use CDC to send the data to Snowflake with Terraform.
But, if you were to set this up manually you’d have to:
- Create a Snowflake database.
- Create a Snowflake warehouse.
- Create a Snowflake schema.
- Create an Astra Streaming tenant.
- Create an Astra table.
- Enable CDC using your Astra Streaming tenant.
- Create an offset topic for the Snowflake Kafka connect sink.
- Create the streaming sink itself configured to send the data to Snowflake.
Instead, you can just run Terraform. For this, we first need to get our credentials in order. It’s always a good approach to assign environment variables in an .env
file and run source .env
before your Terraform commands (just make sure you don’t commit your file to a public repository).
Having done that, we can create our resources.tf
file and run terraform apply
:
Note: The Snowflake user you provide for the Astra Streaming sink must have the proper permissions to create and write to Snowflake tables in the database and schema provided. Here’s an example of how to set up a role that provides these grants:
Writing data and analyzing results
At this stage, you can start writing to your Astra DB table. You’ll see the resulting data flow into Snowflake and you can start performing analytical queries on it. Paste the following queries into CQLSH in the CQL console within your Astra DB user interface:
The data will then be written to dynamically created tables in Snowflake. Note that the number of tables will vary depending on the parallelism of your sink.
Next, you’re ready to query the data in Snowflake. Below is a sample query, but remember to replace your table names with the ones your sink generated:
The results will look as follows:
Note that in this example, we used natural full outer joins to glue the three tables together and extracted the values from the record_metadata
by using record_metadata:key:a as a
. For non primary key columns we would use something like record_metadata:record_content:c as c
.
Conclusion
You’re now equipped to CDC your Astra DB data to Snowflake by simply running terraform apply
! In addition to Snowflake, Astra Streaming supports built-in sinks for multiple systems, including Cassandra, Elasticsearch, ClickHouse, MariaDB, PostgreSQL, SQLite, Kafka, and Kinesis. Stay tuned for more detailed posts on how to use other sinks!
Subscribe to the DataStax Tech Blog for more developer stories. Check out our YouTube channel for free tutorials and follow DataStax Developers on Twitter for the latest news about our developer community.