Migrate data from Cassandra to GCS Using Java and Dataproc Serverless

Shrutimalgatti
Google Cloud - Community
3 min readDec 6, 2023

--

In this post we will explore how to extract data from Cassandra to GCS using Dataproc serverless pre-built templates

The Dataproc Templates provide a flexible and easy-to-use mechanism for managing and executing use cases on Dataproc serverless without the need to develop them.

Cassandra is an open-source NoSQL distributed database that manages large amounts of data across commodity servers. It is a decentralized, scalable storage system designed to handle vast volumes of data across multiple commodity servers, providing high availability without a single point of failure

Key Benefits

  1. These templates are open source and can be used by anyone for their workload migration.
  2. These templates are customisable in nature. It means that, the GitHub repository could be clone very easily and can be used ahead as per our requirement by doing the corresponding code changes.
  3. Dataproc Serverless design frees up the developer from the headache of managing a Dataproc cluster.
  4. Supported File formats are JSON, Avro, Parquet and CSV.
  5. These templates are configuration driven and can be used for similar use cases very easily by just changing the connection parameters.

Pre-requisites

For running these templates, we will need:

  • The Google Cloud SDK installed and authenticated
  • A VPC subnet with Private Google Access enabled. The default subnet is suitable, as long as Private Google Access was enabled.

Steps to Run the Template

  1. Create a GCS bucket for staging location.
  2. Clone git repo in a cloud shell which is pre-installed with various tools. Alternatively use any machine pre-installed with JDK 8+, Maven and Git.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

3. Ensure you have enabled the subnet with Private Google Access. If you are using “default” VPC created by GCP, you will still have to enable private access as below.

gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access

4. Authenticate the GCloud CLI credentials.

gcloud auth login

5. Configure the Dataproc Serverless job by exporting the variables needed for submission

GCP_PROJECT : GCP project id to run Dataproc Serverless on

REGION : Region to run Dataproc Serverless in

GCS_STAGING_LOCATION : GCS staging bucket location, created in Step 3

SUBNET : The VPC subnet to run Dataproc Serverless on, if not using the default subnet (format: projects/<project_id>/regions/<region>/subnetworks/<subnetwork>)

export REGION=<gcp-region>
export GCP_PROJECT=<gcp-project-id>
export GCS_STAGING_LOCATION=<gcs-staging-location>
export JOB_TYPE=SERVERLESS
export SUBNET=<dataproc-serverless-subnet>
bin/start.sh \
-- --template CASSANDRATOGCS \
--templateProperty project.id=<gcp-project-id> \
--templateProperty cassandratogcs.input.keyspace=<keyspace-name> \
--templateProperty cassandratogcs.input.table=<input-table-name> \
--templateProperty cassandratogcs.input.host=<cassandra-host-ip> \
--templateProperty cassandratogcs.output.format=<avro|csv|parquet|json|orc> \
--templateProperty cassandratogcs.output.savemode=<Append|Overwrite|ErrorIfExists|Ignore> \
--templateProperty cassandratogcs.output.path=<gcs-output-path>

Arguments:

  • templateProperty cassandratogcs.input.keyspace: namespace defined in Cassandra DB
  • templateProperty cassandratogcs.input.table : table name in cassandra to query
  • templateProperty cassandratogcs.input.host : cassandra host ip address
  • templateProperty mongo.gcs.output.format: GCS Output File Format (one of: avro,parquet,csv,json)
  • templateProperty mongo.gcs.output.location: GCS Location to put Output Files (format: gs://BUCKET/...)
  • templateProperty mongo.gcs.output.mode: Output write mode (one of: append,overwrite,ignore,errorifexists) (Defaults to append)

Additional parameters that can be passed are below

--templateProperty cassandratogcs.input.catalog.name=<catalog-name>
--templateProperty cassandratogcs.input.query="select * from <catalog-name>.<keyspace-name>.<table-name>"
  • templateProperty cassandratogcs.input.catalog.name : Cassandra connection name (default value : casscon)

6. Monitor the Spark batch job

After submitting the job, you will be able to view the job in the Dataproc Batch UI. From there, we can view both metrics and logs for the job.

References

For any queries or suggestions reach out to: dataproc-templates-support-external@googlegroups.com

--

--