Migrating data from Mongo to GCS using Java and Dataproc Serverless template

Shrutimalgatti
Google Cloud - Community
3 min readNov 20, 2023

Dataproc is a managed Spark and Hadoop service that lets you take advantage of open source data tools for batch processing, querying, streaming, and machine learning

The Dataproc Templates provide a flexible and easy-to-use mechanism for managing and executing use cases on Dataproc serverless without the need to develop them.

These templates implement common Spark workloads, lets you customize and run them easily.

Pre-requisites

For running these templates, we will need:

  • The Google Cloud SDK installed and authenticated
  • A VPC subnet with Private Google Access enabled. The default subnet is suitable, as long as Private Google Access was enabled.

In this post we will explore how to extract data from Mongo db to GCS using Dataproc serverless pre-built templates

MongoDB is an open-source document database built on a horizontal scale-out architecture that uses a flexible schema for storing data

Key Benefits

  1. These templates are open source and can be used by anyone for their workload migration.
  2. These templates are customisable in nature. It means that, the GitHub repository could be clone very easily and can be used ahead as per our requirement by doing the corresponding code changes.
  3. Dataproc Serverless design frees up the developer from the headache of managing a Dataproc cluster.
  4. Supported File formats are JSON, Avro, Parquet and CSV.
  5. These templates are configuration driven and can be used for similar use cases very easily by just changing the connection parameters.

Required JAR files

MongoDB Spark Connector and MongoDB Java Driver

Steps to Run the Template

  1. Create a GCS bucket for staging location and place required JAR files.
  2. Clone git repo in a cloud shell which is pre-installed with various tools. Alternatively use any machine pre-installed with JDK 8+, Maven and Git.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

3. Ensure you have enabled the subnet with Private Google Access. If you are using “default” VPC created by GCP, you will still have to enable private access as below.

gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access

4. Authenticate the GCloud CLI credentials.

gcloud auth login

5. Configure the Dataproc Serverless job by exporting the variables needed for submission —

GCP_PROJECT : GCP project id to run Dataproc Serverless on

REGION : Region to run Dataproc Serverless in

GCS_STAGING_LOCATION : GCS staging bucket location, created in Step 3

SUBNET : The VPC subnet to run Dataproc Serverless on, if not using the default subnet (format: projects/<project_id>/regions/<region>/subnetworks/<subnetwork>)

export GCP_PROJECT=<gcp-project-id>
export SUBNET=<dataproc-serverless-subnet>
export GCS_STAGING_LOCATION=<gcs-staging-location>
export REGION=<gcp-region>
export JARS="gs://{jar-bucket}/mongo_dependencies_mongo-java-driver-3.9.1.jar,gs://{jar-bucket}/kafka-clients-2.8.0.jar,gs://{jar-bucket}/commons-pool2-2.6.2.jar,gs://{jar-bucket}/mongo_dependencies_mongo-spark-connector-10.0.5.jar"
./bin/start.sh \
-- --template=MONGOTOGCS \
--templateProperty mongo.gcs.input.uri=<mongo-uri> \
--templateProperty mongo.gcs.input.database=<input-database> \
--templateProperty mongo.gcs.input.collection=<input-collection> \
--templateProperty mongo.gcs.output.format=<avro|parquet|csv|json> \
--templateProperty mongo.gcs.output.location=<gcs-output-location> \
--templateProperty mongo.gcs.output.mode=<append|overwrite|ignore|errorifexists>

Arguments:

  • templateProperty mongo.gcs.input.uri: MongoDB Connection String as an Input URI (format: mongodb://host_name:port_no)
  • templateProperty mongo.gcs.input.database: MongoDB Database Name (format: Database_name)
  • templateProperty mongo.gcs.input.collection: MongoDB Input Collection Name (format: Collection_name)
  • templateProperty mongo.gcs.output.format: GCS Output File Format (one of: avro,parquet,csv,json)
  • templateProperty mongo.gcs.output.location: GCS Location to put Output Files (format: gs://BUCKET/...)
  • templateProperty mongo.gcs.output.mode: Output write mode (one of: append,overwrite,ignore,errorifexists) (Defaults to append)

6. Monitor the Spark batch job

After submitting the job, you will be able to view the job in the Dataproc Batch UI. From there, we can view both metrics and logs for the job.

References

--

--