HBase to Bigtable Migration Strategy using Snapshots - Lab

Published in

Google Cloud - Community

7 min readDec 27, 2022

The HBase to Cloud Bigtable migration involves moving data from an on-premises HBase table GCP Cloud Bigtable. While migrating data from on-premise Hbase to cloud, Bigtable is preferred as it is a fully-managed, cloud-based NoSQL database which also has an HBase compatible client leading to minimal application changes.

While there are a number of references available for HBase to BigTable migration like Migrating Data from HBase to Cloud Bigtable | Migrating Hadoop to GCP | Google Cloud, the objective of this blog is to provide a detailed implementation guide along with a sample dataset.

Prerequisites for Hands-on Lab:

Create a Dataproc cluster with HBase and Zookeeper WebUI.
Create a sample bucket on Google Cloud Storage for the demo.
Create a Big table instance.
Download the beam import jar and HBase to Bigtable schema translation jar.
For this Lab, a sample data file emp_data has been used, that can be downloaded and used for testing.

Overview and Implementation:

To replicate the scenario of HBase to Bigtable migration, a Dataproc cluster with Hbase pre-installed is set up. The configuration details for Hbase can be found in the WebUI for the Hbase on Dataproc.

Since reading data from HBase region servers directly impacts the performance of the live HBase cluster, the following approach can be used to capture the data in HBase and migrate to Cloud Bigtable:

Take a snapshot of the table from HBase cluster
Export the snapshot to a Cloud Storage bucket
Start a DataFlow job to read the snapshot in the Cloud Storage bucket and import the data into the Cloud Bigtable table in the replicated cluster.

The following diagram presents detailed steps required for the migration.

*Cloud Big Table initial load strategy diagram*

Below are the detailed steps referring to the figure above.

Step 1: Pre-migration state

Before the initial load to BigTable, there are some pre-requisites as mentioned above in the prerequisites step that we need to prepare and create in a GCP project. For a typical customer environment below are the key things to do before the initial load.

Create GCP project
Create GCP services like Buckets, GCE
Create Service accounts and required IAM roles and permissions
Get the Schema Translation tool by using the VM instance
Get the Import tool using the VM instance

Step 2: Export Schema from the Hbase server

For this lab, create a sample table from emp_data sample data file.

create 'emp_data',{NAME => 'cf'}

Load the sample data from csv to the hbase table

hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=',' 
-Dimporttsv.columns='HBASE_ROW_KEY,cf:ename,cf:designation,cf:manager,cf:hire_date,cf:sal,cf:deptno' emp_data /user/aparnavittala/emp_data.csv

List the contents of the table to verify if the data has been loaded.

scan 'emp_data'

Note: In Hbase we have namespaces similar to schemas/databases. In this example we have used the default namespace for simplicity.

Define the environment variables as needed.

PROJECT_ID=<PROJECT_ID>
INSTANCE_ID=<INSTANCE_ID>
TABLE_NAME_REGEX=<TABLE_NAME>
ZOOKEEPER_QUORUM=<ZOOKEEPER_QUORUM>
ZOOKEEPER_PORT=<ZOOKEEPER_PORT>

Step 2: Take snapshots from the Hbase tables.

Take table snapshots by executing the following command in the hbase shell from on-premise edge node. For the lab, we use the tables we just created using the dataproc Hbase cluster.

hbase> snapshot '<tableName>', '<snapshotName>'
hbase> snapshot 'emp_data', 'emp_data_snapshot'

Note: Ensure that the “hbase.snapshot.master.timeout.millis” and “hbase.snapshot.region.timeout” properties in hbase-site.xml are set to a sufficiently large number to avoid timeouts on taking the snapshot. If snapshots are regularly taken for backup purposes, these properties are expected to be tuned appropriately.

Step 3: Create hashes for validation

Next, create hashes to use for validation after the migration is complete. HashTable is a validation tool provided by HBase that computes hashes for row ranges and exports them to files. You can run a sync-table job on the destination table to match the hashes and gain confidence in the integrity of migrated data.

Run the following command for the table that we just exported:

hbase org.apache.hadoop.hbase.mapreduce.HashTable - batchsize=32000 - numhashfiles=20 \
emp_data /user/hbase/emp_data

Step 4: Install GCS connector

This step is not required for this Lab. However for typical migration scenario, ensure installing the GCS connector as mentioned below.

Cloud Storage connector library is to be installed on the HBase cluster along with some configuration changes. Following steps can be performed on the Hadoop cluster (e.g edge node) to configure access to the Cloud Storage:

Download the Cloud Storage connector (gcs-connector-hadoop2–2.1.3-shaded.jar). Make sure that the shaded jar has the -shaded.jar suffix.
Create a parcel for the Cloud Storage connector JAR file and distribute the parcel to all hosts in the cluster.
Create a service account (if not already created) and download the private key in JSON format.

Step 5: Setup GCS connector

Modify the following properties to the core-site.xml and distribute to all nodes in the cluster

fs.AbstractFileSystem.gs.impl
fs.gs.project.id
fs.gs.auth.service.account.enable
google.cloud.auth.service.account.json.keyfile
fs.gs.http.transport.type
fs.gs.proxy.address (if required)
fs.gs.proxy.username (if required)
fs.gs.proxy.password (if required)

Modify the Hadoop class path to point to the Cloud Storage connector jar file in the parcel.

export HADOOP_CLASSPATH=$HADOOP_CLASSPATH:/usr/lib/hadoop/lib/<gcs-connector-jar-file>

Verify access to the Cloud Storage bucket.

hadoop fs -ls gs://<GCS-BUCKET>

Step 6: Run Hbase Export Snapshot job

Execute the following command on an edge node on the Hadoop cluster to export the snapshot to the Cloud Storage bucket.

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot 'emp_hbase_snapshot' -copy-to gs://<snapshot-bucket-name> -mappers 3 -bandwidth 40

Use the -mappers option to control the number of mappers in the export job
Use the -bandwidth option to limit the bandwidth to be used
Alternately, the number of mappers can also be controlled by using the -Dsnapshot.export.default.map.group property to assign a certain number of HFiles to each mapper

The expected time to complete the customer’s table export is based on the network bandwidth and the parallelism of the mappers copying underlying HFiles to Cloud Storage.

Step 7: Provision a Big Table Instance by creating from the console or the command line.

Step 8: Import Table Schema

We can use the Schema Translator tool or use an alternate method of importing and exporting the schema file.

If your HBase master is in a private VPC or can’t connect to internet, you can export the HBase schema to a file and use that to create tables in Cloud Bigtable. Otherwise, we can use the schema translator tool.

Using Schema Translator:

On a host that can connect to HBase, define the export location for your schema file.

HBASE_EXPORT_PATH=gs://hbase_test_load/output/hbase-schema.json

Run the export tool from the host.

java \
-Dgoogle.bigtable.table.filter=emp_data\
-Dgoogle.bigtable.output.filepath=/home/aparnavittala/hbase-schema.json \
-Dhbase.zookeeper.quorum=hive-hbase-test-m:2181 \
-Dhbase.zookeeper.property.clientPort=2181 \
-jar bigtable-hbase-1.x-tools-2.0.0-jar-with-dependencies.jar

Copy the schema file to a host which can connect to Google Cloud.

gsutil cp /home/aparnavittala/hbase-schema.json gs://hbase_test_load/output/hbase-schema.json

Create tables in Cloud Bigtable using the schema file

gsutil cp gs://hbase_test_load/output/hbase-schema.json .
java \
-Dgoogle.bigtable.project.id=<PROJECT-ID> \
-Dgoogle.bigtable.instance.id=<BIGTABLE-INSTANCE-ID> \
-Dgoogle.bigtable.input.filepath=gs://hbase_test_load/output/hbase-schema.json \
-jar bigtable-hbase-1.x-tools-2.0.0-jar-with-dependencies.jar

Validate if the Schema Translator has run successfully by checking for the below two messages from the logs.

19:09:19.520 [main] INFO c.g.c.b.h.t.HBaseSchemaTranslator — Read schema with 1 tables.

19:09:23.533 [main] INFO c.g.c.b.h.t.HBaseSchemaTranslator — Created table emp_data in Bigtable.

Alternatively, we can follow this to export and import the schema.

Step 9: Run the Dataflow jobs

After you have a table ready to migrate your data to, you are ready to import and validate your data.

BigTable import job

IMPORT_JAR="bigtable-beam-import-2.0.0-shaded.jar"
java -jar $IMPORT_JAR importsnapshot \
 - runner=DataflowRunner \
 - project=<PROJECT_ID> \
 - bigtableInstanceId=<BIGTABLE-INSTANCE-ID> \
 - bigtableTableId=emp_data \
 - hbaseSnapshotSourceDir=gs://<HBASE-SNAPSHOT-BUCKET> \
 - snapshotName=emp_hbase_snapshot \
 - stagingLocation=gs://<HBASE-SNAPSHOT-STAGING-BUCKET>/staging \
 - tempLocation=gs://<HBASE-SNAPSHOT-STAGING-BUCKET>/staging/temp \
 - maxNumWorkers=3 \
 - region=us-central1

Data Validation job

java -jar bigtable-beam-import-2.0.0-shaded.jar sync-table \
 - runner=dataflow \
 - project=<PROJECT_ID> \
 - bigtableInstanceId=<BIGTABLE-INSTANCE-ID> \
 - bigtableTableId=emp_data \
 - outputPrefix=gs://<HBASE-LOAD-BUCKET>/output-emp_data-$(date +"%s") \
 - stagingLocation=gs://<HBASE-LOAD-BUCKET>/sync-table/sync-table/staging \
 - hashTableOutputDir=gs://<HBASE-LOAD-BUCKET>/hashtable/emp_data \
 - tempLocation=gs:gs://<HBASE-LOAD-BUCKET>/temp \
 - region=us-central1

After data’s been imported and validated, the initial load activities are considered done.

Common Issues:

GCS connector cannot be installed on the Hbase cluster
Mitigation : Install the GCS connector on the HBase cluster. The push mechanism for Hbase snapshot export is the most common and proven method. In case there is a blocker, choose the Pull mechanism from DataProc Hbase which requires additional configuration from the authentication and networking.
Export snapshot takes too long
Mitigation : This step will run longer compared to the other operations. There are many factors that can affect the time needed for exporting data from On-premise Hbase to GCS. In case needed, we might need to break the snapshots into multiple batches.Another mitigation for reducing the impact to the production system is to run this operation at low hours.
Data load from GCS to BigTable takes too long
Mitigation : In case this happens, consider increasing either Dataflow nodes or BigTable nodes depending on the bottlenecks found in the monitoring.
Data Validation fails
Mitigation : Validation job may fail if it is run immediately after the BigTable load. Wait for a few mins (~5 mins) before triggering the validation job. In case the job still fails, please refer here.
Table Name differs in HBase vs Bigable
Mitigation : Sometimes we may want to change the table name on BigTable. In this scenario we can use the this documentation and map the table names of Hbase and Big Table.

HBase to Bigtable Migration Strategy using Snapshots - Lab

Prerequisites for Hands-on Lab:

Overview and Implementation:

Common Issues:

Written by Aparna Vittala