[PySpark] Load data from GCS to Bigtable — using GCP Dataproc Serverless

Published in

Google Cloud - Community

5 min readDec 13, 2022

One of the newest features of the Google Cloud Dataproc platform, Dataproc Serverless, enables customers to run Spark workloads without having to create or maintain clusters. Dataproc Serverless will take care of all the necessary infrastructure in the background once the Spark workload parameters have been specified and the task has been submitted to the service. It enables developers to focus on the fundamental logic of the application rather than spending time managing the framework.

Without having to create them from scratch, we can run typical use cases on Dataproc Serverless using Java and Python thanks to Dataproc Templates. With the help of these templates, we can simply customise and run common Spark workloads.

This blog article can be useful if you’re seeking for a PySpark template to move data from GCS to Bigtable using Dataproc Serverless.

Key Benefits

GCSToBigTable Template is open source, configuration-driven, and ready to use.
By simply altering the connection parameters, these templates may be utilised for use cases with identical requirements relatively quickly.
These templates are customisable in nature. This means that simply making the necessary code changes, the GitHub repository may be quickly and easily cloned and utilised in the future as needed.
Supported File formats are JSON, CSV, Parquet and Avro.

Prerequisites

For running these templates, we will need:

Google Cloud SDK installed and authenticated. You can use Cloud Shell in the Google Cloud Console to have an environment already configured, from this link.
Python 3.7+ installed and added to your PATH variable.
Login to your GCP Project and enable Dataproc API, if disabled.
Ensure you have enabled the subnet with Private Google Access.If you are going to use “default” VPC Network generated by GCP , Still you will need to enable private access as below

gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access

Required JAR files

When submitting the job, a few HBase and BigTable dependencies must be supplied. These dependencies must be provided using the -jars flag or, in the case of Dataproc Templates, the JARS environment variable.

Some dependencies (jars) must be downloaded from the MVN Repository and placed in your GCS bucket (create one to store the dependencies).

Apache HBase Spark Connector dependencies (already mounted in Dataproc Serverless, so you refer to them using file://):

file:///usr/lib/spark/external/hbase-spark-protocol-shaded.jar
file:///usr/lib/spark/external/hbase-spark.jar

Bigtable dependency

gs://<your_bucket_to_store_dependencies>/bigtable-hbase-2.x-hadoop-2.3.0.jar

Download it using wget 

https://repo1.maven.org/maven2/com/google/cloud/bigtable/bigtable-hbase-2.x-shaded/2.3.0/bigtable-hbase-2.x-shaded-2.3.0.jar

HBase dependencies

gs://<your_bucket_to_store_dependencies>/hbase-client-2.4.12.jar

Download it using wget 

https://repo1.maven.org/maven2/org/apache/hbase/hbase-client/2.4.12/hbase-client-2.4.12.jar

gs://<your_bucket_to_store_dependencies>/hbase-shaded-mapreduce-2.4.12.jar

Download it using wget 

https://repo1.maven.org/maven2/org/apache/hbase/hbase-shaded-mapreduce/2.4.12/hbase-shaded-mapreduce-2.4.12.jar

GCSToBigTable Template Requirements

It uses the Apache HBase Spark Connector to write to BigTable.

Here in this template, you will notice that there are different configuration steps for the PySpark job to run using Dataproc Serverless, connecting to BigTable using the HBase interface.

Configure the hbase-site.xml (reference) with your BigTable instance reference. The hbase-site.xml needs to be available in some path of the container image used by Dataproc Serverless. For that, you need to build and host a customer container image in GCP Container Registry.

Add the following layer to the Dockerfile, for it to copy your local hbase-site.xml to the container image (already done) :

COPY hbase-site.xml /etc/hbase/conf/

Build the Dockerfile, building and pushing it to GCP Container Registry with:

IMAGE=gcr.io/<your_project>/<your_custom_image>:<your_version>
docker build -t "${IMAGE}" .
docker push "${IMAGE}"

An SPARK_EXTRA_CLASSPATH environment variable should also be set to the same path when submitting the job

--container-image="gcr.io/<your_project>/<your_custom_image>:<your_version>"  # image with hbase-site.xml in /etc/hbase/conf/
--properties='spark.dataproc.driverEnv.SPARK_EXTRA_CLASSPATH=/etc/hbase/conf/'

2. Configure the desired HBase catalog json to passed as an argument (table reference and schema). The hbase-catalog.json should be passed using the — gcs.bigtable.hbase.catalog.json

--gcs.bigtable.hbase.catalog.json='''{
                    "table":{"namespace":"default","name":"<table_id>"},
                    "rowkey":"key",
                    "columns":{
                    "key":{"cf":"rowkey", "col":"key", "type":"string"},
                    "name":{"cf":"cf", "col":"name", "type":"string"}
                    }
                }'''

3. Create and manage your Bigtable table schema, column families, etc, to match the provided HBase catalog.

Configuration Arguments

This template includes the following arguments to configure the execution:

gcs.bigtable.input.location: GCS location of the input files (format: gs://<bucket>/...)
gcs.bigtable.input.format: Input file format (one of: avro,parquet,csv,json)
gcs.bigtable.hbase.catalog.json: HBase catalog inline json

Steps to execute GCSToBigTable Dataproc Template

1. Create a GCS bucket to use as the staging location for Dataproc.This bucket will be used to store dependencies/ Jar files required to run our serverless cluster.

export STAGING_BUCKET=<gcs-staging-bucket-folder>
gsutil mb gs://$STAGING_BUCKET

2. Clone the Dataproc Templates repository and navigate to the directory for the Python template.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git

cd dataproc-templates/python

3. Obtain authentication credentials (to submit the job).

gcloud auth application-default login

4. Configure the Dataproc Serverless job

To submit the job to Dataproc Serverless, we will use the provided bin/start.sh script. The script requires us to configure the Dataproc Serverless cluster using environment variables.

The mandatory configurations are:

GCP_PROJECT : The GCP project to run Dataproc Serverless on.
REGION : The region to run Dataproc Serverless on.
GCS_STAGING_LOCATION : A GCS location to where Dataproc will store staging assets. Should be within the bucket we created earlier.

export GCP_PROJECT=<project_id>
export REGION=<region>
export GCS_STAGING_LOCATION=gs://$STAGING_BUCKET/staging

Export environment variable for JARs. You can also choose to store the JAR files on a bucket you own.

export JARS="gs://<your_bucket_to_store_dependencies>/bigtable-hbase-2.x-hadoop-2.3.0.jar, \
             gs://<your_bucket_to_store_dependencies>/hbase-client-2.4.12.jar, \
             gs://<your_bucket_to_store_dependencies>/hbase-shaded-mapreduce-2.4.12.jar, \
             file:///usr/lib/spark/external/hbase-spark-protocol-shaded.jar, \
             file:///usr/lib/spark/external/hbase-spark.jar"

5. Execute the GCS To Bigtable Dataproc template

After configuring the job, we are ready to trigger it. We will run the bin/start.sh script, specifying the template we want to run and the argument values for the execution.

./bin/start.sh \
--container-image="gcr.io/<your_project>/<your_custom_image>:<your_version>" \
--properties='spark.dataproc.driverEnv.SPARK_EXTRA_CLASSPATH=/etc/hbase/conf/' \ # image with hbase-site.xml in /etc/hbase/conf/
-- --template=GCSTOBIGTABLE \
   --gcs.bigtable.input.format="<json|csv|parquet|avro>" \
   --gcs.bigtable.input.location="<gs://bucket/path>" \
   --gcs.bigtable.hbase.catalog.json='''{
                        "table":{"namespace":"default","name":"my_table"},
                        "rowkey":"key",
                        "columns":{
                        "key":{"cf":"rowkey", "col":"key", "type":"string"},
                        "name":{"cf":"cf", "col":"name", "type":"string"}
                        }
                    }'''

6. Monitor the Spark batch job

After submitting the job, we will be able to see in the Dataproc Batches UI. From there, we can view both metrics and logs for the job.

References

For any queries/suggestions reach out to: dataproc-templates-support-external@googlegroups.com