How to update row keys in Google Big Table

Philipp Schmiedel
5 min readFeb 5, 2019

--

Life can only be understood backwards, but it must be lived forwards.

Choosing a correct row key format for Google Cloud Big Table is one of the most important decisions you have to make when using this technology. Not only you need to understand how the data looks like you want to store in Big Table, you also should have already an idea about which queries you need later to solve your particular business cases. And if that’s not already complicated enough, common best practices like using a random hash as a prefix for your row key might not be the best idea anymore in 2019:

In the past, this [GCP documentation] page recommended using a hash of the string identifier, rather than the actual string identifier. We no longer recommend using a hash.

Ok, no problem, right? If it turns out your row key was not chosen wisely, let’s just change it… until you realize some choices in life are irreversible, like row keys in Big Table.

If we can’t update, migrate

So if we can’t update the row key and just inserting all upcoming new items with a changed row key format doesn’t have any effect (because row keys are lexically ordered), the only solution left is to create a new table in our cluster. In this new table, we can start to insert new data using our changed row key design. If all processes writing into the legacy table have been stopped, we can continue with migrating the old data into the new table via the following steps:

  1. Export the Big Table content into Cloud Storage
  2. Transform the row keys into the new format
  3. Import the adjusted data into the new Big Table

As your Big Table probably contains a huge amount of data, this is a job for Google’s Dataflow service. Luckily Dataflow already comes with build in templates for exporting and importing Google Big Table data to and from Cloud Storage in a compact binary format called Avro.

Dataflow Bigtable to Avro Files Template

All you have to do is to create a new Dataflow job, select the template, fill in all needed parameters and start the execution. Note that Google recommends using 2–3 Dataflow cores per node your Big Table cluster has.

Data transformation using custom templates in Dataflow

What is left is the data transformation process itself. This is the point where you get in touch with creating your own Dataflow template and the Apache Beam framework. And if you have never worked with Apache Beam before, the documentation can be overwhelming. I hope the following snippets will save you some time creating your own transformation logic, let’s start coding…

Note: I used the Apache Beam Python SDK just for convenience reasons, you might get better results using the Java SDK. Feel free to post a Java version in the comments section.

A custom template basically consists of two files:

  • A metadata file describing the template and parameters, the filename MUST end with _metadata (without file extension!)
  • The template code itself containing an Apache Beam pipeline

For the meta description of our transformation template we define that we need the location of the Avro files and a destination Cloud Storage:

To create a Google Cloud compatible Apache Beam pipeline we install the pip package apache-beam[gcp] and need to implement:

  • Logic to parse the parameters we just defined in the metadata definition as runtime parameters using the add_value_provider_argument()
  • A definition of a Big Table compatible Avro schema
  • Extension of class beam.DoFn containing the transformation logic to change the BigTable row key
  • A pipeline definition gluing everything together, including reading and writing the Avro files

To not overload this blog post with too much code, you’ll find the first two bullet points as part of the complete example linked at the end of this chapter. The more important part is how we transform the row key and why this format is chosen.

In this example, the old row key follows Googles outdated suggestion to prefix the row key with a random hash: <randomHash>_YYYYMMDD_HHmmSS . This design is especially problematic if we need to extract, let’s say the last month of data, because Big Table needs to scan the complete database for this. However, letting the row key start with a date is also not a good solution, as this to monotonically increasing key would lead to hotspotting. To solve this problem we choose a random but still guessable prefix: A single leading random digit matching the number of (planned) nodes in our cluster.

The Python code for the Apache Beam transformation logic looks like this:

This transformation logic is used in our final Apache Beam pipeline:

The complete code can be found in my Github repository.

Start the engines

The pipeline metadata and the pipeline code itself need to be deployed into a Google Storage bucket as described in the README in my repository. If everything worked as expected you should now be able to select the custom pipeline template in Google Dataflow. Select the exported BigTable Avro-files as input location, a file pattern as output location and you’re good to go:

Google Dataflow dialog with selected custom template

Finally, we use the Google Dataflow template Avro Files on Cloud Storage to Cloud Bigtable to import our transformed data into a new Google Bigtable. You can use the cbt command-line tool to check the inserted items for our new row key. For the provided example pipeline we’ll find something like the following pattern: 0_20190126_131844_680_49da57bf , which is our expected outcome.

Same same, but different

The data in our new Bigtable is of course the same, however the new row key structure has a significant impact on the query performance. Attaching the table as an external BigQuery data source and querying for a certain amount of time was only possible with a SQL pattern like … WHERE rowkey LIKE %_201901% before. This forced BigTable to do a full table scan which can take hours for huge amount of data. The new structure of the row key allows us to search by prefix using a SQL pattern like … WHERE rowkey LIKE ‘0_201901%’ OR rowkey LIKE ‘1_201901%’ rowkey LIKE ‘2_201901%’ rowkey LIKE ‘3_201901%’ . BigTable will jump directly to the needed data and stop searching when the lexically ordered row key for next month ( *_201902* ) is found.

Update 2019–04–25

Actually it looks like that BigQuery is not smart enough to make full use of the optimized rowkey when using LIKE (see stackoverflow question). However it works if we use a query like this: WHERE rowkey > "0_20190425" AND rowkey < "0_20190426" . While a query for 300k rows using the LIKE-syntax took 10 minutes to execute, this syntax took 2 seconds.

This row key design is especially helpful if you need to extract data by timespan. For your particular business case, another design might make the difference between query runtimes of hours or just minutes. If you want to dig deeper into Apache Beam, I can recommend their extended Beam Programming Guide. Happy coding!

--

--