How DataStax Enterprise Analytics Simplifies Migrating to DataStax Astra DB

DataStax Enterprise (DSE), the hybrid cloud NoSQL database built on Apache Cassandra®, integrates real-time and batch operational analytics capabilities with an enhanced version of Apache Spark™. In this post, we introduce you to this Spark tool and show you how you can use it to seamlessly migrate your application data from DSE to Astra DB.

Shifting to a true cloud-native open data stack built on DataStax Astra DB and Astra Streaming can unlock game-changing capabilities and help you fully harness the power of data. With DSE Analytics, you can quickly generate ad-hoc reports, target customers with personalization, and process real-time streams of data. (You can learn about the benefits of DSE Analytics here.)

Enterprises that are already running DSE can easily achieve this migration with the help of DSE Analytics. This Spark tool, which we’ll tell you about in this post, lets you make a direct connection to your Astra DB from DSE Analytics without having to provision a separate Spark cluster or worrying about other setups.

To show you how it works, we’ll be using an example application that we created specifically to move data from DSE to Astra DB using DSE Analytics. Although, with simple changes, you can alter it to reverse the flow or compare data between the two data stores.

Prerequisites

Before you get started, there are a few small but important steps you need to take care of to prepare for your lift-off to Astra DB.

Dual writes

Setting up dual writes prior to your migration will enable you to perform a zero-downtime migration from DSE to Astra DB. This isn’t a necessity, though, and beyond the scope of what I want to show you today. If you’re interested in exploring how to set up dual writes, take a look at our blog post on how to move to Astra DB with zero downtime.

Download example Spark code

These instructions reference sample code that can be found here. The sample code will be set to use DSE 6.8.18. Using a different version of DSE will require you to update the DSE version in the build.sbt file located in the root directory of the project.

Validate your DSE version

To connect to Astra DB with Spark via DSE Analytics, you’ll need to upgrade to or install/run a minimum version of DSE listed below. These versions contain DSP-21510 which enables the connection with Astra DB:

Create your Astra DB database

Make sure that your Astra DB database has been created and is ready to accept data. If you need help, here’s how to create an Astra DB database. Once this is confirmed, you’ll need to create the appropriate table definitions in your Astra DB instances for your migration. You can do this via the CQL console in the Astra DB UI or by using the REST or GraphQL APIs. For the purpose of these instructions, we’ll use the following example schema, but you could leverage any schema for the migration using this procedure:

Create your Astra DB database

Make sure that your Astra DB database has been created and is ready to accept data. If you need help, here’s how to create an Astra DB database. Once this is confirmed, you’ll need to create the appropriate table definitions in your Astra DB instances for your migration.

You can do this via the CQL console in the Astra DB UI or by using the REST or GraphQL APIs. For the purpose of these instructions, we’ll use the following example schema, but you could leverage any schema for the migration using this procedure:

Keyspace: test_spark_migration

Table: data_table

Download the Secure Connect Bundle

You will need to download the Secure Connect Bundle from Astra DB to connect to your Astra DB instance via the Spark Cassandra Connector (SCC), which is included with your distribution of DSE. By following the red numbers in the diagram below, you’ll generate the Secure Connect Bundle for download. Upload this to your DSE Analytics cluster and note the absolute path for later.

Figure 1. Procedure to download the Secure Connect Bundle (SCB) from Astra DB GUI.

Generate your application token

To generate the application token, follow the diagram below. We’ll leave the credentials and tokens unobscured to keep this guide simple, but we’ll delete the database and tokens when we’re done.

Figure 2. Navigate to the Application Token creation page on the AstraDB GUI.
Figure 3. Procedure to create and download the Application Token from Astra DB GUI.

Once you’ve downloaded your Application Token CSV, go to the next step.

Configuring the Spark code

Once you have the necessary information, you can begin to alter the Scala code at the Migration.Scala. The entirety of our changes will happen in the following location:

Each of the sparkConfAstra options will be replaced with information from our Application Token. Using the information in the previous examples, my completed configuration would look like this:

The keyspace and table won’t be changed in this guide, but make sure you change them if you plan to use a different keyspace/table combination.

Compiling the Spark jar

Once the changes to the Migration.Scala and build.sbt files have been made, you’re ready to compile your Spark jar.

To do so, run the following command from the root of the project:

sbt clean assembly

The Migration.Jar will reside in the root of the project at:

dse-astra-spark-migration/target/scala-2.11/Migration.jar

Running the migration

Now that the jar is compiled, we can carry out the migration. To do so, execute the following command from your DSE cluster:

  • [username] = Source DSE username
  • [password] = Source DSE password
  • [num cores] = Int value for number of cores
  • [num executors] = Int value for number of executors
  • [GB of memory] = Int value for GB of memory. Note the “G” needs to be supplied.
  • [SCB Path] = Absolute path to the secure connect bundle provided by Astra DB.

Note that the [num cores], [num executors], and the [GB of memory] values will be determined based on the resources available for your DSE Analytics enabled cluster.

What’s next?

In this post, you’ve learned how to migrate your application data from DSE to Astra DB using the powerful, built-in DSE Analytics. So where do you go from here? If you haven’t already, register for your free Astra DB account and get up to 80GB each month to play around with. Then, check out the resources below to learn what you can do with Astra DB and discover other handy technologies as you build your open data stack.

Follow DataStax on Medium for exclusive posts on all things open source, including Spark, Cassandra, streaming, Kubernetes, and more. To join a buzzing community of developers from around the world and stay in the data loop, follow DataStaxDevs on Twitter and LinkedIn.

Resources

  1. Tour of Astra DB interface
  2. Explore our Sample App Gallery
  3. Build sample apps with Astra DB
  4. Astra DB integrations
  5. CDC for Apache Cassandra
  6. DataStax Astra Streaming
  7. Connecting to your database using Stargate APIs or CQL Drivers
  8. Spark Cassandra Connector-specific properties in DSE Analytics
  9. DSE Spark Connector API documentation
  10. DSE 6.8 Developer Guide
  11. DSE 5.1 Developer Guide

--

--

--

We’re huge believers in modern, cloud native technologies like Kubernetes; we are making Cassandra ready for millions of developers through simple APIs; and we are committed to delivering the industry’s first and only open, multi-cloud serverless database: DataStax Astra DB.

Recommended from Medium

The fuck I just required to go on this subreddit

YOUR Metaverse: How to customize your Avatar on (Crypto)Voxels.com

AMA about my internship at GovTech

The 4 Trends of Edge Computing

Create AWS Lambda Layers for Python Packages

Install BigTree CMS on CentOS 7

Organise your swift code //MARK:

First Day with Apple Watch OS 8

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
DataStax

DataStax

DataStax is the company behind the massively scalable, highly available, cloud-native NoSQL data platform built on Apache Cassandra®.

More from Medium

Trino on Google Kubernetes Engine | Big Data Analytics at Scale

Streaming Analytics With KSQL vs. a Real-Time Analytics Database

Streaming Analytics vs Real-Time Analytics Database

How dose Apache SeaTunnel (Incubating) refactor the API to decouple with the computing engine?

Building Reference Architectures for User-Facing Analytics