Load data from GCS to Bigtable — using GCP Dataproc Serverless
One of the newest features of the Google Cloud Dataproc platform, Dataproc Serverless, enables customers to run Spark workloads without having to create or maintain clusters. Dataproc Serverless will take care of all the necessary infrastructure in the background once the Spark workload parameters have been specified and the task has been submitted to the service. It enables developers to focus on the fundamental logic of the application rather than spending time managing the framework
Without having to create them from scratch, we can run typical use cases on Dataproc Serverless using Java and Python thanks to Dataproc Templates. With the help of these templates, we can simply customise and run common Spark workloads.
This blog article can be useful if you’re seeking for a Spark-Java template to move data from GCS to Bigtable using Dataproc Serverless.
Key Benefits
- GCSToBigTable Template is open source, configuration-driven, and ready to use.
- Supported File formats are CSV, Parquet and Avro.
- By simply altering the connection parameters, these templates may be utilised for use cases with identical requirements relatively quickly.
Basic Usage
1)Google Cloud SDK installed and authenticated. You can use Cloud Shell in the Google Cloud Console to have an environment already configured, from this link.
2)Ensure you have enabled the subnet with Private Google Access, if you are going to use “default” VPC Network generated by GCP. Still you will need to enable private access as below.
gcloud compute networks subnets update default - region=us-central1 - enable-private-ip-google-access
3)Clone git repo in a cloud shell which is pre-installed with various tools. Alternatively you may use any machine pre-installed with JDK 8+, Maven and Git.
git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java
4)Create a GCS bucket and a staging folder. This bucket will be used to store dependencies/ Jar files required to run the serverless cluster
5)Create a GCS bucket and an input folder. The required files (csv/avro/parquet) needs to be uploaded in this folder and the location to be provided while execution
6)Create a BigTable table with the required column family within the instance of the cluster. (Create instance if not exists)
7)Obtain authentication credentials (to submit the Dataproc job).
gcloud auth application-default login
8)Configure the Dataproc serverless job:
To execute a dataproc job, the following configurations needs to be set.
GCP_PROJECT
: The GCP project to run Dataproc Serverless on.REGION
: The region to run Dataproc Serverless on.GCS_STAGING_LOCATION
: A GCS location to where Dataproc will store staging assets. Should be within the bucket that was created earlier.
export GCP_PROJECT=<project_id>
export REGION=<region>
export GCS_STAGING_LOCATION=<gcs-staging-bucket-folder>
9)Execute GCS to BigTable template, specifying the template and the following argument values for the execution
project.id
: Project id of pub/subgcs.bigtable.input.location
: location of the input filegcs.bigtable.input.format
: Format of the input file (csv/avro/parquet)gcs.bigtable.output.instance.id
: Instance Id of the Bigtablegcs.bigtable.output.project.id
: Project Id of the Bigtablegcs.bigtable.table.name
: Bigtable Table namegcs.bigtable.column.family
: column family of the bigtable table.
bin/start.sh \
-- --template GCSTOBIGTABLE \
--templateProperty project.id=<gcp-project-id> \
--templateProperty gcs.bigtable.input.location=<gcs file location> \
--templateProperty gcs.bigtable.input.format=<csv|parquet|avro> \
--templateProperty gcs.bigtable.output.instance.id=<bigtable instance Id> \
--templateProperty gcs.bigtable.output.project.id=<bigtable project Id> \
--templateProperty gcs.bigtable.table.name=<bigtable tableName> \
--templateProperty gcs.bigtable.column.family=<bigtable column family>
Here is an example submission:
export GCP_PROJECT=your-project-id
export REGION=your-region
export GCS_STAGING_LOCATION=gs://your-bucket/temp
bin/start.sh -- --template GCSTOBIGTABLE --templateProperty project.id=your-project-id --templateProperty gcs.bigtable.input.location=gs://your-bucket/test/file.csv --templateProperty gcs.bigtable.input.format=csv --templateProperty gcs.bigtable.output.instance.id=your-bt-instance-id --templateProperty gcs.bigtable.output.project.id=your-project-id --templateProperty gcs.bigtable.table.name=your-bt-table --templateProperty gcs.bigtable.column.family=cf
NOTE: It will ask you to enable Dataproc Api, if not enabled already.
Here is a sample input files in GCS and output row in BigTable
Sample CSV file (Us-cities.csv)
name,post_abbr,zip,phonecode
Alabama,AL,94519,925
Alaska,AK,94520,408
Sample Row in BigTable:
Alabama
cf:name @ 2022/12/20-06:22:34.057000
"Alabama"
cf:phonecode @ 2022/12/20-06:22:34.057000
"925"
cf:post_abbr @ 2022/12/20-06:22:34.057000
"AL"
cf:zip @ 2022/12/20-06:22:34.057000
"94519"
Alaska
cf:name @ 2022/12/20-06:22:34.602000
"Alaska"
cf:phonecode @ 2022/12/20-06:22:34.602000
"408"
cf:post_abbr @ 2022/12/20-06:22:34.602000
"AK"
cf:zip @ 2022/12/20-06:22:34.602000
"94520"
10) Monitor the Spark batch job
After submitting the job, we will be able to see in the Dataproc Batches UI. From there, we can view both metrics and logs for the job.
References
https://cloud.google.com/bigtable/docs/overview
https://github.com/GoogleCloudPlatform/dataproc-templates
For any queries/suggestions reach out to: dataproc-templates-support-external@googlegroups.com