Zero Downtime Migration from HBase to Bigtable

Axatha Jalimarada
Box Tech Blog
Published in
13 min readSep 24, 2021
Illustrated by Navied Mahdavian / Art directed by Sarah Kislak

Box has traditionally utilized compute-services hosted in its own data centers while relying on cloud services mostly for geographically diverse storage. However this is rapidly changing as it migrates more of its core infrastructure to the cloud. One major aspect of this migration is the move from on-premises HBase to Google Cloud Bigtable (CBT) for several NoSQL use-cases. We decided to go with Bigtable because it is a fully-managed, cloud-based NoSQL database which also has an HBase compatible client so minimal application changes are required. We have recently completed one of our first major migrations and wanted to share some of the lessons we learned along the way of that zero downtime migration.

Every customer file uploaded to Box is encrypted using a unique data encryption key. The creation, retrieval and storage of this key metadata is managed by an internal micro-service called the Key Encryption Decryption Service (KEDS). This service stores all of its data in a NoSQL table hosted on multiple HBase clusters in Box data centers. At the start of the migration there were approximately 200 billion keys in that data set for a total table size in HBase of approximately 80TB. These fully-replicated clusters are used in an active-active-active configuration and located in diverse geographic regions. Since the data being migrated is used in every operation that accesses storage content in Box this metadata sees approximately 80,000 reads/sec and 20,000 writes/sec. All of this results in several project requirements that we could not compromise on.

  • First, the migration had to involve zero downtime of the metadata store or KEDS.
  • Second, neither the migration process nor the final state could dramatically change the service’s latency.
  • Finally, an incorrect or corrupted key is a failure and thus we require a strong guarantee of data integrity.

Cloud Bigtable Performance Validation

There is a big difference between a product specification and successfully using that product as part of a mission-critical application. The first step in our journey away from HBase was to validate that Cloud Bigtable would indeed work for our use cases and that our initial project assumptions were correct.

Let’s first quickly go over a few Bigtable configuration options. A Bigtable instance is a collection of one or more Bigtable clusters, with each cluster consisting of one or more nodes. Clusters can exist in independent regions and support automatic replication between them.

The routing of requests to the clusters within an instance is controlled by an “App Profile”. An app profile can be of two types:

  • Single-cluster routing: routes all requests to a single cluster in the instance. If that cluster becomes unavailable, you must manually failover to another cluster in the instance.
  • Multi-cluster routing: automatically routes requests to the nearest cluster in the instance. If the cluster becomes unavailable, traffic automatically fails over to the nearest cluster that is available.
Table 1: A comparison of advertised per-node performance during our initial Bigtable validation.

An important choice to make when creating a Bigtable instance is deciding whether to use hard-disk drives (HDD) or solid state drives (SSD). As expected an SSD based instance offers higher performance but with a commensurate increase in cost. In order to determine which storage type to use we utilized a synthetic workload based on our production system. What we quickly discovered was that a HDD instance could not deliver the latency requirements we needed for our interactive workload. However since our dataset was relatively small we were able to utilize an SSD instance while still staying within our target budget.

Key Learnings:

  • It is important to evaluate the HDD and SSD storage options with regards to your workloads latency requirements, size and budget constraints.
  • Our synthetic workload was critical in allowing us to observe the sort of performance we could realistically achieve at a stable state as each workload’s unique mix of reads and writes and their interactions will impact the cluster’s performance. For example we were able to achieve reliable 70–80ms latencies for both writes and reads with SSDs while we struggled to maintain randoms reads below 2 seconds with HDD.
Figure 1: A summary of our results when testing HDD based Bigtable instances compared to SSD based instances and our synthetic workload.

Proof of Concept

With initial validation complete we proceeded with SSD based clusters since HDD clusters was unviable for our workload. The next step beyond our synthetic workload was to exercise and observe Bigtable using our application with a real workload in our development environment. We set up a Bigtable instance with two clusters and implemented dual-writes between HBase and Bigtable; the dual-writes were later validated against each other to ensure correctness. We also took a point-in-time snapshot of a small HBase cluster (15GB) in our development environment and imported it to the development Bigtable instance. In the course of validating our application against this development instance we made a number of interesting discoveries.

First, not all Bigtable features are supported in all of the multi-cluster configurations. For example, Bigtable does not support check-and-put operations with their multi-cluster routing which provides automatic fail-over. Since we already supported fail-over at the application level this was acceptable to us, but initial testing with the actual service and workload is critical in validating your end-to-end configuration.

Second, we initially (and perhaps naively) assumed the dataset’s size would be approximately the same in HBase and Bigtable. However, much to our surprise, when initially imported, the dataset was approximately 50% larger in Bigtable, potentially due to additional versions of the rows that are copied during import. While Bigtable and HBase both support versioning, Bigtable has no practical limit on maxVersions. So, we set maxVersions to a reasonable value to avoid accumulating a lot of unwanted versions. A week after import, there was still a 20% increase in size compared to HBase. We have thus budgeted for 50% extra capacity during migration and 20% extra capacity for the long term.

Third, when importing a HBase snapshot into Bigtable, you run the risk of reviving rows that were deleted in Bigtable after the HBase snapshot was taken. This can be addressed in two primary ways. You can capture and replay the deletes in Bigtable based on timestamps after migration, or you disable deletes temporarily during migration if your use case supports this. We were able to do the latter which greatly simplified things.

Finally, we ran into protobuf errors due to the incompatibilities between the Cloudera HBase client and Bigtable’s HBase compliant client. We had to exclude the Apache HBase and Hadoop packages being brought in by the Bigtable client to solve this. This advanced look at the library conflicts we would run into during the production migration helped us avoid a number of unwelcome surprises.

Migration Plan

Finally, with the performance validated, and initial testing of our application and all of its libraries behind us, it was time to develop a plan to migrate our production data sets.

Our plan essentially involved a phased approach in which we begin with asynchronous operations to both HBase and Bigtable; synchronous operations must succeed but asynchronous operations are allowed to fail. This allowed us to canary Bigtable in production and at scale. As we built confidence in Bigtable, we transitioned from asynchronous access to synchronous access to complete reliance on Bigtable.

Figure 2: An overview of our migration plan showing the transitions between synchronous and asynchronous dual writes.
  1. We start by doing asynchronous dual writes/reads to Bigtable. These Bigtable writes/reads are best-effort, done after every operation to HBase.
  2. Next, we switch to doing synchronous dual writes/reads to maintain data consistency between HBase and Bigtable. At this point, we rely on Bigtable critically for service availability as these operations must succeed.
  3. We then take a snapshot of the table in HBase and import it to Bigtable to backfill old data. This must be done after the synchronous operations are enabled in step 2 in order to ensure that all records exist in both HBase and Bigtable.
  4. Then we perform various validation steps to ensure that data between HBase and Bigtable is actually consistent.
  5. Next, we start using Bigtable as the source of truth while still continuing to write to both HBase and Bigtable. This allows us to fallback to HBase if any issues occur with Bigtable.
  6. After safely running in the previous state for several weeks, we cut off ties with HBase.

Dual Writes

With the above plan in place, we began by implementing asynchronous, best-effort reads and writes from our application to Bigtable. Then, we rolled it out to production, which consists of two SSD based clusters. This step is critical as it allows us to monitor the Bigtable metrics and stabilize the cluster without impacting our clients. This phase proved invaluable as we encountered two interesting scenarios.

  1. DEADLINE_EXCEEDED Errors:

Soon after deploying asynchronous dual writes/reads, we started seeing sporadic spikes in DEADLINE_EXCEEDED errors from Bigtable. These would cause an error rate of 0.4–1% for the KEDS service. These would occur every few hours or few times a day and last anywhere from few seconds to minutes. Although our clients did retry their requests, these types of sudden spikes in failures were not acceptable to us. This observation kicked off an in-depth investigation between Box and Google engineers spanning several weeks and teams as diverse as storage, networking and edge within both companies.

Figure 3: Example of the a spike in DEADLINE_EXCEEDED errors. Note that there is no obvious build up before the spike in errors.

Root Cause and Remediations:

  • The root cause was found to be a network connection bug in Google FrontEnd (GFE) proxy layer which is a layer that fronts all Google Cloud Services. The GFE nodes would periodically be in the process of shutting down (as part of a normal weekly restart or some failures). In such a scenario, the KEDS service with open connections to these dying GFE nodes would never receive a GOAWAY signal to indicate that the connection should no longer be used. This had the effect of our traffic being dropped on those connections with timeouts. The GFE team fixed it by sending proper shutdown signal.
  • In order to be improve our resilience in the face of such errors, we added additional logic to automatically fallback to our second Bigtable cluster. This also helped improve the overall error rate.

2. CPU Spikes:

Most of the traffic to KEDS service comes from uploads and downloads to Box. But there are some occasional ad hoc workloads that can lead to unexpected read-write traffic to Bigtable. Our Bigtable clusters were provisioned to handle more than 4X this traffic based on the documented throughput of SSD based clusters (up to 10,000 reads/sec or writes/sec). However, we noticed CPU spikes both in average cluster CPU and on the hottest node, which was beyond the recommended 50% threshold. While we were within the recommended range of general operations, we speculate that our check-and-put operations were more expensive than we anticipated. This experience suggests that a cluster’s performance may vary widely based on your specific workloads and operation types.

Figure 4: Illustration of spikes seen in Average Cluster CPU and Hottest Node CPU on the primary Bigtable cluster during ad hoc workload on the service

Remediations:

In order to insulate our most critical requests while still supporting ad hoc requests in a graceful manner, we created an additional fleet of KEDS instances and designated the two as front and back:

  • The “front” fleet serves the main upload and download traffic. These are our most critical requests with the tightest latency requirements. By default this fleet favors the Bigtable cluster with the best performance. This is generally the cluster that is geographically closest to our main application.
  • The “back” fleet serves ad hoc and async traffic that does not have strict latency requirements. This fleet is configured to favor the other Bigtable cluster.

This approach helped split the CPU load between the 2 Bigtable clusters. This also helped us to independently scale up the back Bigtable cluster in case there was a large ad hoc workload that needed to be serviced.

Figure 5: Overview of the two pools of services, the workloads that each handles and their relation to the replicated BigTable clusters.

Exporting and Importing the Snapshots

After we addressed the above issues and stabilized the Bigtable metrics, we deployed synchronous dual-writes. In this step, every write request to KEDS service must be written to both HBase and Bigtable before returning a success to the client. This ensures that HBase and Bigtable are consistent from the point that the synchronous dual-writes are deployed.

The next step is to backfill the old data in Bigtable from HBase through the use of HBase snapshots.

Figure 6: Overview of snapshot export process from HBase cluster and into Bigtable via Google Cloud Storage (GCS) bucket and Dataproc job

The snapshot process involved three main steps: producing the snapshot from HBase, exporting the snapshot to Google Cloud, and importing the snapshot into Bigtable.

  • The first step, obtaining the snapshot, is relatively straight-forward; it uses the standard HBase snapshot command. For us this was:
snapshot ‘keds_table_name’, ‘keds_snapshot’
  • The second step involved exporting the snapshot to Google Cloud Storage (GCS) bucket using the HBase ExportSnapshot job. The ExportSnapshot job requires the destination to be a file system. In order to treat the GCS bucket as a file system, we need GCS Connector, an open-source Java library. We downloaded the connector jar from here and copied it to all the data nodes on the HBase cluster. Following that, we created a Google Cloud Service Account with permissions to write objects to our GCS bucket and downloaded the credentials into a JSON file and copied it to all the data nodes. Finally, we updated the following properties to the core-site.xml file on all the HBase data nodes.
<property>
<name>fs.AbstractFileSystem.gs.impl</name> <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value></property>
<property>
<name>fs.gs.project.id</name>
<value>google_cloud_project_id</value>
</property>
<property>
<name>fs.gs.auth.service.account.enable</name>
<value>true</value>
</property>
<property><name>google.cloud.auth.service.account.json.keyfile</name><value>path_to_service_account_credentials_json_file</value></property>

With all of that complete, the actual export snapshot command is as follows:

hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot  -libjars /usr/lib/hadoop/lib/gcs-connector-hadoop2-2.1.6-shaded.jar -snapshot keds_snapshot -copy-to gs://bucket_name/hbase_snaphost_home
  • The third and final step was importing the snapshot to the Bigtable instance by running a MapReduce job in Google Cloud Dataproc. Here we ran a MapReduce job similar to this one in Dataproc. During this job, we scaled up our Bigtable instance to roughly have one Bigtable node per mapper task. This ensured that this additional write traffic did not disrupt the incoming customer traffic from synchronous dual-writes.

After the import, when we scaled down the Bigtable cluster, we noticed higher latencies for a few minutes. This was potentially due to data being rebalanced between the nodes. We found that scaling down at the rate of 50 nodes every 30 minutes was safe enough to not cause any big latency spikes.

Validation

Once the snapshots imports were complete, the next step was to validate that the data between Bigtable and HBase was consistent. Specifically, we wanted to validate the new, incoming traffic as well as the older data.

Validating Ongoing Traffic:

We added logic to asynchronously compare every read request result from HBase with the result from Bigtable. For any mismatches found, we categorized them into different critical and non-critical types and emitted metrics and logs. This helped us track any data mismatches for all read traffic on an ongoing basis for latest traffic.

Validating Old Data:

In order to sample and test older data, we made use of an internal service which iterates over all the encryption keys and calls the KEDS service read APIs. This again triggers the async comparison logic mentioned above to log any mismatches.

From these validation steps, we found that most of the mismatches were expected or explainable and gave us sufficient confidence in the consistency of data between HBase and Bigtable.

Committing to Bigtable!

The final last step was to flip to using Bigtable as the source of truth while still continuing to write to both HBase and Bigtable. We let the system bake in this state for a couple of weeks before finally disabling all writes to HBase.

We have been running solely on Bigtable for the past few months and we have not encountered any issues so far. :)

Parting Thoughts and Key Lessons

Performance testing is worth it

  • Every workload is different and the throughput numbers advertised by any cloud provider are highly variable based on the workload type. So, it was very useful to find out early on and plan for the appropriate Bigtable storage type to achieve our performance SLAs.

Check-and-put affects everything

  • Having check-and-put operations meant that we couldn’t use the multi-cluster routing and had to implement our own failover logic between clusters. Additionally, check-and-put operations potentially reduce the throughput you can derive from the clusters.

Metrics are amazing!

  • Having lots of fine-grained metrics to track everything in HBase and Bigtable helped us compare the two systems when we were doing dual-reads and writes.
  • We also imported the Google Cloud Monitoring metrics into our dashboard to have a side-by-side comparison of our client side metrics and Bigtable server side metrics. This helps isolate issues between our stack and Bigtable Service.

Good debug tools are indispensable

  • Through this journey, we discovered a few helpful GCP tools while debugging some of the issues described.
  • Stackdriver Tracing, an instrumentation feature for Google Cloud Services, helped us track how the Bigtable requests propagated through our application and gave detailed latency insights at every step of the request in Google Cloud Console.
  • Key Visualizer, a diagnostic feature available on Bigtable Console, helped isolate hotspots in our key space.

Special thanks to Mark Storer, Senior Software Development Manager, Box, for being immensely helpful with editing this blog!

Interested in learning more about Box? Checkout our careers page!

--

--