Cross-Cloud HBase/Phoenix Data Migration

Jins George
Mar 2 · 4 min read

Do you run HBase with Phoenix? Have you ever run into challenges migrating data from one cluster to another? If yes, then we hope this post will be helpful to you. Aeris just completed a project to migrate HBase/Phoenix data from AWS to GCP (primarily for strategic reasons) and gained some valuable practical experience that we are now ready to share.

When we decided to migrate from AWS to GCP, one of the main challenges we faced was how to migrate non-relational Databases, especially self-managed databases like HBase/Phoenix. Our first choice was to export a snapshot of all HBase tables from the AWS cluster to S3 and then import to a new cluster in GCP. The export worked perfectly fine, but during the import we ran into issues. For example, the leading character got trimmed off from the first column in some tables, compromising data integrity.

Our next — successful — approach was to use Phoenix Spark Plugin, which enabled our systems to read a whole table as Spark DataFrame from the source cluster and write to the same table in destination cluster.

Prerequisites

To take advantage of the Spark Plugin, there are a few prerequisites:

  1. First, you will need a Spark cluster to execute the migration script. If you are in GCP, Dataproc cluster is the best choice.
  2. Second, you will need to make sure that your source and destination HBase/Zookeeper nodes are routable from the spark cluster. In our case, we established VPN connectivity between our AWS account and GCP project.
  3. Third, you’ll need to create Phoenix tables in your destination cluster with the exact same schema as the source tables.

Once you’ve established that you meet those three prerequisites, below are the steps to take next.

Step 1: Spark Script

First, you’ll need to write the migration script. Here is a sample Python script which creates a Spark DataFrame from source cluster and writes to destination cluster:

from pyspark.sql import SparkSession
import sys

def phoenix_data_export_import(spark, sourceZk, destinationZk, table):
#Read from phoenix as Dataframe
df = spark.read \
.format("org.apache.phoenix.spark") \
.option("table", table) \
.option("zkUrl", sourceZk) \
.load()

#Write the dataframe to destination phoenix.
df.write \
.format("org.apache.phoenix.spark") \
.mode("overwrite") \
.option("table", table) \
.option("zkUrl", destinationZk) \
.save()

# sys.argv[1] - Source Zookeeper URL ( HOST:PORT)
# sys.argv[2] - Destination Zookeeper URL ( HOST:PORT)
# sys.argv[2] - File containing list of tables, one table per line
if __name__ == "__main__":

spark = SparkSession \
.builder \
.appName("Phoenix Data Export Import Job") \
.getOrCreate()
# path of file containing list of tables.
tableList = open(sys.argv[3], "r")
for table in tableList:
try:
phoenix_data_export_import(spark, sys.argv[1], sys.argv[2], table.strip())
except:
print("Exception in exporting/import of table: {} ".format(table))
tableList.close()
spark.stop()

Step 2: Create Dataproc Cluster

Next, you will create a Dataproc cluster which comes with Spark. Thanks to GCP commands, you can launch the cluster with single command, right from the GCP console. Set the arguments according to your environment. Adjust the adjust worker machine type to match the volume of your data and how many concurrent tables you want to load.

gcloud dataproc clusters create phoenix-data-migration-cluster \
--region <region> \
--subnet <subnet> \
--no-address \
--zone "" \
--master-machine-type n1-standard-4 \
--master-boot-disk-size 100 \
--num-workers 4 \
--worker-machine-type n1-highmem-8 \
--worker-boot-disk-size 100 \
--image-version 1.3-deb9 \
--scopes 'https://www.googleapis.com/auth/cloud-platform' \
--labels component=phoenix-data-migration-cluster \
--project <your project id>

Submit Migration Job

Now, it’s time to see things in action! Submit the migration job to the cluster with this command. Change arguments according to your environment. NOTE: I am providing the Phoenix client jar dependency.

gcloud dataproc jobs submit pyspark \
--cluster phoenix-data-migration-cluster \
--region <region> \
phoenix-data-export-import.py \
--files=tables.txt \
--jars=lib/phoenix-client-4.14.1-HBase-1.2.jar \
-- <Source Zookeeper>:2181 \
<Destination Zookeeper>:2181 \
tables.txt

If you want to speed up the migration, create a cluster with more resources and submit multiple jobs in parallel.

Delete Dataproc cluster

Assuming everything was successful, you no longer need the migration cluster. Delete it to avoid unnecessary billing.

gcloud dataproc clusters delete phoenix-data-migration-cluster

Results & Conclusion

Using this method, we migrated roughly 70 tables containing 40GB of data in less than an hour. Not bad.

In summary, the Phoenix Spark Plugin provides an easy way to model migration use cases. Coupled with the fact that cloud environments make it easy to provision Spark clusters, a Phoenix data migration can be a relatively easy undertaking.

Aeris Things

Jins George

Written by

Data Engineer at Aeris

Aeris Things

Internet of Things technology from Aeris

More From Medium

More from Aeris Things

More on Gcp from Aeris Things

More on Gcp from Aeris Things

More on Gcp from Aeris Things

SaaS Cloud Migration: The Sticker Price

1

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade