Bigtable backup for disaster recovery

Maciej Sieczka
Egnyte engineering
Published in
6 min readJan 30, 2020

At Egnyte, we use Google Cloud services a lot. Having those relatively easy to grasp and integrate, reliable, highly scalable tools at our hand has definitely boosted the company’s productivity.

However, as it usually is with ready-made products, one needs to be prepared to adapt and complement them as needed. Here’s a short story about one such case we have encountered.

Google Cloud Bigtable backup

In Protect — the project I’ve been doing the operations for at Egnyte, we heavily rely on Bigtable to store the file content classification results. Protect is Egnyte’s data governance solution, which enables users to keep their business content secure and compliant with the latest industry standards. Content classification’s main job is to scan and classify customer’s cloud and on-premise file repositories to identify private and sensitive information, such as postal addresses, dates of birth, financial, health-related data, etc. It can use pre-defined criteria and custom keywords, as well as templates matching customers’ enforcement jurisdiction for industry-specific regulations (such as PCI-DSS, HIPAA, GLBA, GDPR). Please visit the Protect website for more details.

By its nature, Bigtable is highly resilient — in Google’s own words: “Google uses proprietary storage methods to achieve data durability above and beyond what’s provided by standard HDFS three-way replication. In addition, we create backups of your data to protect against catastrophic events and provide for disaster recovery.

But better safe than sorry. We can’t take chances with our customers’ data. An automated backup, under our control at all times, preferably external to the Bigtable itself, became a must when we started using Bigtable in Protect for mission-critical data several months ago. Since then our production Bigtable instance grew 10 times in size. I’m pretty sure I would have slept much worse this year if we haven’t had the backup in place…

Unfortunately, Google didn’t have a backup solution for Bigtable in their offer when we needed it. There was only an alpha-grade snapshot functionality in cbt, which Google used to enable individually for selected customers on their demand. We approached Google support team only to learn they’ve already had stopped offering it. Besides, a snapshot would only store metadata so that you can go back in time, but it’s by no means a backup for disaster recovery.

We considered Bigtable replication as a temporary mitigation for our backup needs, but:

  • It’s not really a backup either. We need to be able to recover historical data in case the working copy gets corrupted, while the replication would immediately propagate any errors.
  • It’s expensive. You pay twice the amount you used to pay for the original data, plus the network traffic between the replicas.
  • It’s just another Bigtable cluster. It would be located in a different region, which reduces the chance of a disaster. But we wanted to protect the Protect against a total Bigtable outage or an accidental cluster deletion as well.

Consequently, we had to come up with something on our own. The most feasible option proved to be dumping the tables’ contents into a GCS bucket as a series of Hadoop sequence files, using the Java Bigtable HBase client, as per Google instructions at that time.

When I started working on this, Google hadn’t yet provided (or at least it wasn’t documented) their Dataflow template for Bigtable to GCS export (which uses the same Java Bigtable HBase library under the hood). Even if we wanted to take advantage of this approach, we’d still have to implement iterating the input tables as well as storing their names and column families, to be able to use them in a backup restore process. The reason is that either the Hadoop sequence files the Java Bigtable HBase client writes to GCS don’t have those metadata, or the Java Bigtable HBase client doesn’t use them during import. One way or the other, a destination table with the correct column families has to exist in Bigtable in advance, for the table import to succeed; at least as of jar version 1.13.0 and earlier.

So in the end, not being a Java speaker myself, I wrapped the executable shaded jar version of the Java Bigtable HBase library in Python, with the help of Google’s Bigtable and storage libraries to iterate the tables and save their names and column family names in GCS, along with the actual table contents. The script was deployed in a Kubernetes cronjob. The end results are periodic Dataflow jobs that export the tables and their metadata into a GCS bucket.

To illustrate the backup script’s functionality here’s its command line help:

./bigtable_export.py --help
usage: bigtable_export.py [-h] --beam_jar_path BEAM_JAR_PATH
--gcp_project GCP_PROJECT
--bigtable_instance_id
BIGTABLE_INSTANCE_ID --bigtable_cluster_id
BIGTABLE_CLUSTER_ID --bucket_name
BUCKET_NAME [--table_id_prefix
TABLE_ID_PREFIX]
Dump all tables in a given Bigtable instance to a GCS bucket, as a series of Hadoop sequence files.optional arguments:
-h, --help show this help message and exit
--table_id_prefix TABLE_ID_PREFIX
Backup only the tables with this prefix in
their ID.
required arguments:
--beam_jar_path BEAM_JAR_PATH
Path to the Bigtable HBase client jar file.
--gcp_project GCP_PROJECT
ID of the Bigtable instance parent GCP
project.
--bigtable_instance_id BIGTABLE_INSTANCE_ID
ID of the Bigtable instance.
--bigtable_cluster_id BIGTABLE_CLUSTER_ID
ID of the cluster in the Bigtable instance.
--bucket_name BUCKET_NAME
GCS bucket name to dump the Bigtable tables
into. The output directory is named after
the export start time, in `YYYY-mm-dd-HH-MM-
SS` format. Input tables are saved as series
of Hadoop sequence files in its
sudbirectories named after the table names.

The restore procedure relies on the same Bigtable HBase shaded jar and Google Python libraries to read the Hadoop sequence files and their metadata back from the GCS, and to apply them onto a destination Bigtable cluster in a series of Dataflow jobs. The import script functionality in detail is as follows:

./bigtable_import.py --help
usage: bigtable_import.py [-h] --beam_jar_path BEAM_JAR_PATH
--gcp_project GCP_PROJECT
--bigtable_instance_id
BIGTABLE_INSTANCE_ID --bigtable_cluster_id
BIGTABLE_CLUSTER_ID --bucket_name
BUCKET_NAME --backup_gcs_dir
BACKUP_GCS_DIR [--force]
Restore series of Hadoop sequence files in a GCS bucket as Bigtable tables. If a destination table already exists, it's skipped.optional arguments:
-h, --help show this help message and exit
--force Proceed with import even if there's no
successful backup marker blob at the GCS
location indicated by `--backup_gcs_dir`.
required arguments:
--beam_jar_path BEAM_JAR_PATH
Path to the Bigtable HBase client jar file.
--gcp_project GCP_PROJECT
ID of the Bigtable instance parent GCP
project.
--bigtable_instance_id BIGTABLE_INSTANCE_ID
ID of the Bigtable instance.
--bigtable_cluster_id BIGTABLE_CLUSTER_ID
ID of the cluster in the Bigtable instance.
--bucket_name BUCKET_NAME
GCS bucket name to fetch the Bigtable dumps
from.
--backup_gcs_dir BACKUP_GCS_DIR
Bucket directory with subdirectories
containing the Hadoop sequence files to be
imported.

The script doesn’t overwrite the already existing tables on the destination Bigtable cluster. If that’s required, they have to be removed from there first. Again — better safe than sorry ;).

The retention policy for the backup data is applied using a bucket’s lifecycle feature. We store the most recent table dumps (i.e., those most likely to be ever recovered) in a Regional storage class and move them into a Coldline class after some time, to reduce the cost while still being able to recover/inspect the old data in case. The respective Terraform code looks something like this:

lifecycle_rule {
action {
type = "SetStorageClass"
storage_class = "COLDLINE"
}

condition {
age = <no of days>
with_state = "ANY"
}
}

lifecycle_rule {
action {
type = "Delete"
}

condition {
age = <no of days>
with_state = "ANY"
matches_storage_class = ["COLDLINE"]
}
}

That’s it, in a nutshell. If you have any comments or would like to know more, please leave your feedback below, and we’ll be happy to dive deeper as needed.

We’ll see whether this tooling remains useful as a second-line defense against Bigtable disaster, user errors, and application bugs if a managed service for Bigtable backups becomes available. In any case, it’s been doing its job very well for the past several months. And I sleep at least as well :).

UPDATE: On Thu Feb 20 I have published the scripts on Egnyte’s GitHub, under MIT license.

--

--