Using Dataproc Serverless to migrate your Hbase data to GCS

Published in

Google Cloud - Community

4 min readDec 13, 2022

We can use Dataproc Serverless to run Spark batch workloads without provisioning and managing our own cluster. We can specify workload parameters, and then submit the workload to the Dataproc Serverless service.

Dataproc Serverless helps the users with the whole job of infrastructure management — to execute their Apache Spark workloads users are not required to create a cluster first to execute anything. Users can simply select a template as per their use case and perform their respective job with a few clicks and commands.

Hbase to GCS Migration using Dataproc Serverless

Objective

This blog post will share complete end to end details on how you can use “Hbase to GCS Dataproc Serverless Template” for data migration. This template will move data from Hbase tables to GCS buckets.

Setup your GCP Project and Infra

Login to your GCP Project and enable Dataproc API(if it is disabled)
Make sure that the subnet is enabled with Private Google Access, if you are going to use “default” VPC Network generated by GCP then also, you will have to enable private access as shown below:

gcloud compute networks subnets update default --region=us-central1 --enable-private-ip-google-access

3. Create a GCS bucket and staging location for jar files.

export GCS_STAGING_BUCKET="my-gcs-staging-bucket"
gsutil mb gs://$GCS_STAGING_BUCKET

4. To configure the Dataproc Serverless job, you need to export the following variables:-

GCP_PROJECT : GCP project id to run Dataproc Serverless on.

REGION : Region to run Dataproc Serverless in.

GCS_STAGING_LOCATION : GCS staging bucket location, where Dataproc will store staging assets (See Step 3).

Steps to execute Dataproc Template

Clone the Dataproc Templates repository and navigate to the Java template folder.

git clone https://github.com/GoogleCloudPlatform/dataproc-templates.git
cd dataproc-templates/java

2. Get authentication credentials (to submit the job).

gcloud auth application-default login

3. Configure the Dataproc Serverless job by exporting the variables needed for submission(as explained in Step 4 of “Setup you GCP Project & Infra”).

export GCP_PROJECT=<project_id> # your Google Cloud project
export REGION=<region> # your region for ex: us-central1
export SUBNET=<subnet> # optional if you are using default
# export GCS_STAGING_LOCATION=<gcs-staging-bucket-folder> # already done at step 3(Under Setup your GCP Project & Infra)

4. Setting up HBase dependencies:- There are a few HBase dependencies that are required to be passed when submitting the job to Dataproc Serverless. These dependencies are automatically set by the script when CATALOG environment variable is set for hbase table configuration. Incase if it is not, then these dependencies are required to be passed by using the — jars flag, or, in the case of Dataproc Templates, using the JARS environment variable.

Apache HBase Spark Connector dependencies (These are already mounted in Dataproc Serverless, so you can refer them using file://):

file:///usr/lib/spark/external/hbase-spark-protocol-shaded.jar
file:///usr/lib/spark/external/hbase-spark.jar
All other dependencies are automatically downloaded and are set once the CATALOG environment variable is used for hbase table configuration. Library links(for reference purposes):- hbase-client, hbase-shaded-mapreduce

5. Passing the hbase-site.xml to the Job:- Now, there are two ways to do this. First, by automatic process and second, by manually creating a custom container for dataproc cluster. Both of these are illustrated below:-

I. Automatic process of creating custom container:- The process is automatically done in the start-up script, when environment variable HBASE_SITE_PATH is set.

II. Configure the hbase-site.xml manually and create a container. Steps for the same are mentioned below:-

The hbase-site.xml needs to be available in some path of the container image used by Dataproc Serverless.
Reference hbase-site.xml can be used by adding respective values for hbase.rootdir and hbase.zookeeper.quorum.
A custom container image is required in GCP Container Registry. Refer Dockerfile for reference.
Add the following layer to the Dockerfile, for copying your local hbase-site.xml to the container image (below command is added to Dockerfile for reference):

COPY hbase-site.xml /etc/hbase/conf/

You can use and adapt the Dockerfile from the guide above, building and pushing it to GCP Container Registry with:

IMAGE=gcr.io/<your_project>/<your_custom_image>:<your_version>
docker build -t "${IMAGE}" .
docker push "${IMAGE}"

6. Execute the below command:-

Note:- It is important to set CATALOG Environment variable here to provide hbase connection and for script to download required dependencies.

export GCP_PROJECT=<gcp-project-id>
export REGION=<region>
export SUBNET=<subnet>
export GCS_STAGING_LOCATION=<gcs-staging-bucket-folder>
export IMAGE_NAME_VERSION=<name:version of image>
export HBASE_SITE_PATH=<path to hbase-site.xml>
export CATALOG=<catalog of hbase table>
export IMAGE=gcr.io/${GCP_PROJECT}/${IMAGE_NAME_VERSION} #use the image which was created to configure hbase-site.xml

bin/start.sh \
--container-image=$IMAGE \
--properties='spark.dataproc.driverEnv.SPARK_EXTRA_CLASSPATH=/etc/hbase/conf/'  \
-- --template HBASETOGCS \
--templateProperty hbasetogcs.output.fileformat=<avro|csv|parquet|json|orc>  \
--templateProperty hbasetogcs.output.savemode=<Append|Overwrite|ErrorIfExists|Ignore> \
--templateProperty hbasetogcs.output.path=<output-gcs-path>
--templateProperty hbasetogcs.table.catalog=$CATALOG

Sample Execution

Please refer the sample execution below:-

export GCP_PROJECT=myproject
export REGION=us-central1
export GCS_STAGING_LOCATION=gs://staging_bucket
export JOB_TYPE=SERVERLESS 
export SUBNET=projects/myproject/regions/us-central1/subnetworks/default
export IMAGE_NAME_VERSION=dataproc-hbase:1
export HBASE_SITE_PATH=src/main/resources/hbase-site.xml
export CATALOG='{"table":{"namespace":"default","name":"my_table"},"rowkey":"key","columns":{"key":{"cf":"rowkey","col":"key","type":"string"},"name":{"cf":"cf","col":"name","type":"string"}}}'
export IMAGE=gcr.io/${GCP_PROJECT}/${IMAGE_NAME_VERSION}  #set this to pass custom image during job submit

bin/start.sh \
--container-image=$IMAGE \
--properties='spark.dataproc.driverEnv.SPARK_EXTRA_CLASSPATH=/etc/hbase/conf/'  \
-- --template HBASETOGCS \
--templateProperty hbasetogcs.output.fileformat=csv \
--templateProperty hbasetogcs.output.savemode=append \
--templateProperty hbasetogcs.output.path=gs://myproject/output  \
--templateProperty hbasetogcs.table.catalog=$CATALOG

Sample Catalog:-

{
   "table":{
      "namespace":"default",
      "name":"my_table"
   },
   "rowkey":"key",
   "columns":{
      "key":{
         "cf":"rowkey",
         "col":"key",
         "type":"string"
      },
      "name":{
         "cf":"cf",
         "col":"name",
         "type":"string"
      }
   }
}

Also, In case you need to specify spark properties supported by Dataproc Serverless for ex: adjust the number of drivers, cores, executors etc — You can edit the OPT_PROPERTIES values in start.sh file.

7. Monitor the Spark batch job

After submitting the job, we will be able to see it in the Dataproc Batches UI. From there, we can view both metrics and logs for the job.

References

For any queries/suggestions please reach out to: dataproc-templates-support-external@googlegroups.com