From a performance perspective, one of the most important things to consider while designing your Bigtable Schema is how to properly design your index. And our friends at Urban Fitness Forge ran into this headlong recently.
Urban Fitness Forge is one of the latest in awesome real time fitness tracker products. Their advanced clothing line carries over 50 sensor measurements for the human body, each sampled 10 times a second. After their initial series A funding, they saw a huge boom in usage for users as they were picked up by a local clothing company. The result, too much data for their existing back-ends, which prompted them to move to the ever powerful Cloud Bigtable.
However, after their migration, they suddenly noticed that it was taking a lot of memory and time to fetch / sort results for their real time updating of statistics. By the time I got involved, their engineering team was in a Code Red to find the problem.
Finding the problem
UFF’s usage pattern for their data dominantly queried lat/long values to find specific sensors within an area for analysis. Or rather, given how often they ran this query, it was way too slow for their needs.
Their schema looked similar to what you’d expect. Their key was comprised of long/lat values for a sample inserted with a delimiter in the middle (e.g. “33.749#-84.388”) :
Performance wise, this simple design was creating a massive bottleneck for them.
While looking at some logging data, we quickly realized that our scans for data points near Portland, Oregon (USA) ended up also scanning and filtering through data points in Milan & Venice Italy. After some digging, researching, and asking around, we realized that there were three problems working in unison to cause this performance issue:
- These cities are very close to each other in terms of Latitude:
- Bigtable stories entries in their shards lexicographically by their index
- Bigtable doesn’t have explicit support for secondary indexing.
Let’s break this down a little more.
Bigtable stories entries in their shards lexicographically by their index, meaning that entries with lexicographically similar keys will be closer to each other in the tablets. Queries that access a single row, or a contiguous range of rows, execute quickly and efficiently. However if need something like a secondary index (like the “LONG” in a “LAT/LONG” pair) then your queries end up in a full table scan, which will be far slower. A full table scan is exactly what it sounds like — every row of your table is examined in turn, which, as your table sizes grow, your performance suffers more.
This is where point #1 above comes into play. Portland is very close to two other large cities in terms of latitude:
The above image shows Portland, Milan and Venice overlaid on a map, you can see the full interactive map here.
As such, their storage in rows tends to be very close to each other, despite being geographically very far away. This means a full table scan involves walking through all the table entries that share a latitude value across the globe — A much larger scan size than we were hoping for.
Building the right index
This continues to highlight an important aspect of Bigtable schema design: choosing a row key that facilitates common queries is of paramount importance to the overall performance of the system.
For Urban Fitness Forge, that meant trying to create an index such that entities close to each other physically were sorted close to each other in table space, in order to reduce the scan size.
Thankfully, this has been solved before. A space filling curve is a way to organize a 2D space such that adjacent indexes are pseudo similar in physical space (I like this explanation if you are looking for more).
For example, if we stored points in a 2D set using a raster index (index = Y*WIDTH+X) we’d end up with something that looked like this:
Meanwhile, a space filling curve does this:
You can see that in the first example (aka RASTER order) that while 6 looks physically close to 2, it’s a whole row away from it, in terms of index. However, in the Space Filling Curve, the closest index values to 6 are physically close as well.
The result is a modified index for entities which use a space-filling curve integrated with a timestamp value to create a single, integer based index.
For Urban Fitness Forge, this meant a massive performance improvement for queries that sampled row data via lat/long, as it allowed items which were geographically close to each other, to also be adjacent to each other in the table.
Your mileage may vary
Now, obviously using a Hilbert space filling curve is not the solution to every Cloud Bigtable index option, but it does highlight the important part: to get the best performance out of Cloud Bigtable, it’s essential to think carefully about how you compose your row key. By choosing the correct row key template, you can avoid a painful data migration process later.
To do this, start by asking how you’ll use the data that you plan to store. For example:
- User information: Do you need quick access to information about connections between users (for example, whether user A follows user B)?
- User-generated content: If you show users a sample of social-style content, such as status updates, how will you decide which status updates to display to a given user?
- Time series data: Will you often need to retrieve the most recent N records, or records that fall within a certain time range? If you’re storing data for several kinds of events, will you need to filter based on the type of event?
By understanding your needs up front, you can ensure that your row key template, and your overall schema design, provide enough flexibility to query your data efficiently.