Importing data from GCS to MongoDB using Dataproc Serverless

Hitesh Hasija
Google Cloud - Community
3 min readJul 27, 2022

Managing Server’s while running Spark jobs is always a challenge. Using fully managed on demand servers with Spark jobs is the need of today’s era. It helps developers in concentrating towards core application logic, instead of spending time in managing the framework. Dataproc Serverless is one such product provided by Google Cloud Platform.

Today’s world is moving towards Cloud based storage services for storing data. It has triggered the use of Google Cloud Storage Buckets. It is very easy to store data in GCS Buckets irrespective of their file formats. It is a very cost effective way to store huge data files, especially when data is in TB’s.

NoSQL Database’s are in high demand these days. MongoDB is very famous amongst them. MongoDB is basically a Document Oriented Database, and writes data in the form of BSON (Binary JSON).

One of the main reason for using MongoDB is to deal with high scale unstructured data. Hence, importing data from Cloud based technologies to MongoDB is a very common use case. But, what if, input format is different from that of JSON structure. In such kind of scenarios, this article would be going to help you by importing the data from GCS Buckets to MongoDB Collection via Dataproc Serverless, irrespective of their file formats.

Key Benefits

  1. Use Dataproc Serverless to run Spark batch workloads without managing Spark framework. Batch size is also configurable in this template.
  2. GCSToMONGO Template is open source, configuration driven, and ready to use. Only MongoDB and GCS credentials are required to execute the code.
  3. Supported File formats are JSON, Avro, Parquet and CSV.

Usage

1. Create a GCS bucket and staging location for jar files.

2. Clone git repo in a cloud shell which is pre-installed with various tools. Alternatively use any machine pre-installed with JDK 8+, Maven and Git.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/python

3. Obtain authentication credentials (to submit the job).

gcloud auth application-default login

4. Execute GCSToJDBC template.
Eg:

export GCP_PROJECT=my-gcp-project
export REGION=us-central1
export GCS_STAGING_LOCATION=gs://staging-bucket
export JARS="gs://jar_location/mongo-java-driver-3.9.1.jar,gs://jar_location/mongo-spark-connector_2.12-2.4.0.jar"
./bin/start.sh \
-- --template=GCSTOMONGO \
--gcs.mongo.input.format="avro" \
--gcs.mongo.input.location="gs://GCS_Bucket_Name/empavro" \
--gcs.mongo.output.uri="mongodb://1.2.3.45:27017" \
--gcs.mongo.output.database="demo" \
--gcs.mongo.output.collection="analysis" \
--gcs.mongo.output.mode="overwrite"

NOTE: It will ask you to enable Dataproc Api, if not enabled already.

Schedule the batch job

GCP natively provides Cloud Scheduler + Cloud Function which can be used to submit spark batch jobs. Alternatively self managed softwares like linux cron tab, Jenkins etc. can be used as well.

Optimising Performance with large scale Data

Definitely dealing with large scale data (in TB’s) is always a challenge for a Data Engineer. But, this coding template provides solution for that as well. There is an optional parameter “Batch size”, which could be tweaked as per the Data Size. Batch size is how many documents the driver requests from server at once. By default, the batch size provided by MongoDB is 512.

--gcs.mongo.batch.size=512

Setting additional spark properties

In case you need to specify spark properties supported by Dataproc Serverless like adjust the number of drivers, cores, executors etc.

You can edit the OPT_PROPERTIES values in start.sh file.

References
https://medium.com/google-cloud/importing-data-from-gcs-to-databases-via-jdbc-using-dataproc-serverless-7ed75eab93ba
https://github.com/GoogleCloudPlatform/dataproc-templates

--

--