How DataStax Enterprise Analytics Simplifies Migrating to DataStax Astra DB
Author: Madhavan Sridharan
DataStax Enterprise (DSE), the hybrid cloud NoSQL database built on Apache Cassandra®, integrates real-time and batch operational analytics capabilities with an enhanced version of Apache Spark™. In this post, we introduce you to this Spark tool and show you how you can use it to seamlessly migrate your application data from DSE to Astra DB.
Shifting to a true cloud-native open data stack built on DataStax Astra DB and Astra Streaming can unlock game-changing capabilities and help you fully harness the power of data. With DSE Analytics, you can quickly generate ad-hoc reports, target customers with personalization, and process real-time streams of data. (You can learn about the benefits of DSE Analytics here.)
Enterprises that are already running DSE can easily achieve this migration with the help of DSE Analytics. This Spark tool, which we’ll tell you about in this post, lets you make a direct connection to your Astra DB from DSE Analytics without having to provision a separate Spark cluster or worrying about other setups.
To show you how it works, we’ll be using an example application that we created specifically to move data from DSE to Astra DB using DSE Analytics. Although, with simple changes, you can alter it to reverse the flow or compare data between the two data stores.
Prerequisites
Before you get started, there are a few small but important steps you need to take care of to prepare for your lift-off to Astra DB.
Dual writes
Setting up dual writes prior to your migration will enable you to perform a zero-downtime migration from DSE to Astra DB. This isn’t a necessity, though, and beyond the scope of what I want to show you today. If you’re interested in exploring how to set up dual writes, take a look at our blog post on how to move to Astra DB with zero downtime.
Download example Spark code
These instructions reference sample code that can be found here. The sample code will be set to use DSE 6.8.18. Using a different version of DSE will require you to update the DSE version in the build.sbt
file located in the root directory of the project.
Validate your DSE version
To connect to Astra DB with Spark via DSE Analytics, you’ll need to upgrade to or install/run a minimum version of DSE listed below. These versions contain DSP-21510 which enables the connection with Astra DB:
Create your Astra DB database
Make sure that your Astra DB database has been created and is ready to accept data. If you need help, here’s how to create an Astra DB database. Once this is confirmed, you’ll need to create the appropriate table definitions in your Astra DB instances for your migration. You can do this via the CQL console in the Astra DB UI or by using the REST or GraphQL APIs. For the purpose of these instructions, we’ll use the following example schema, but you could leverage any schema for the migration using this procedure:
Create your Astra DB database
Make sure that your Astra DB database has been created and is ready to accept data. If you need help, here’s how to create an Astra DB database. Once this is confirmed, you’ll need to create the appropriate table definitions in your Astra DB instances for your migration.
You can do this via the CQL console in the Astra DB UI or by using the REST or GraphQL APIs. For the purpose of these instructions, we’ll use the following example schema, but you could leverage any schema for the migration using this procedure:
Keyspace: test_spark_migration
Table: data_table
Download the Secure Connect Bundle
You will need to download the Secure Connect Bundle from Astra DB to connect to your Astra DB instance via the Spark Cassandra Connector (SCC), which is included with your distribution of DSE. By following the red numbers in the diagram below, you’ll generate the Secure Connect Bundle for download. Upload this to your DSE Analytics cluster and note the absolute path for later.
Generate your application token
To generate the application token, follow the diagram below. We’ll leave the credentials and tokens unobscured to keep this guide simple, but we’ll delete the database and tokens when we’re done.
Once you’ve downloaded your Application Token CSV, go to the next step.
Configuring the Spark code
Once you have the necessary information, you can begin to alter the Scala code at the Migration.Scala
. The entirety of our changes will happen in the following location:
Each of the sparkConfAstra
options will be replaced with information from our Application Token. Using the information in the previous examples, my completed configuration would look like this:
The keyspace and table won’t be changed in this guide, but make sure you change them if you plan to use a different keyspace/table combination.
Compiling the Spark jar
Once the changes to the Migration.Scala
and build.sbt
files have been made, you’re ready to compile your Spark jar.
To do so, run the following command from the root of the project:
sbt clean assembly
The Migration.Jar
will reside in the root of the project at:
dse-astra-spark-migration/target/scala-2.11/Migration.jar
Running the migration
Now that the jar is compiled, we can carry out the migration. To do so, execute the following command from your DSE cluster:
- [username] = Source DSE username
- [password] = Source DSE password
- [num cores] = Int value for number of cores
- [num executors] = Int value for number of executors
- [GB of memory] = Int value for GB of memory. Note the “G” needs to be supplied.
- [SCB Path] = Absolute path to the secure connect bundle provided by Astra DB.
Note that the [num cores], [num executors], and the [GB of memory] values will be determined based on the resources available for your DSE Analytics enabled cluster.
What’s next?
In this post, you’ve learned how to migrate your application data from DSE to Astra DB using the powerful, built-in DSE Analytics. So where do you go from here? If you haven’t already, register for your free Astra DB account and get up to 80GB each month to play around with. Then, check out the resources below to learn what you can do with Astra DB and discover other handy technologies as you build your open data stack.
Follow DataStax on Medium for exclusive posts on all things open source, including Spark, Cassandra, streaming, Kubernetes, and more. To join a buzzing community of developers from around the world and stay in the data loop, follow DataStaxDevs on Twitter and LinkedIn.
Resources
- Tour of Astra DB interface
- Explore our Sample App Gallery
- Build sample apps with Astra DB
- Astra DB integrations
- CDC for Apache Cassandra
- DataStax Astra Streaming
- Connecting to your database using Stargate APIs or CQL Drivers
- Spark Cassandra Connector-specific properties in DSE Analytics
- DSE Spark Connector API documentation
- DSE 6.8 Developer Guide
- DSE 5.1 Developer Guide