Migration to Google Cloud Platform — New Geostore

Published in

Yik Yak Engineering

7 min readMar 13, 2017

Yik Yak’s primary use case is to quickly return the most recent N messages posted, relative to a user’s current location, as bounded by either a circle of arbitrary radius or an arbitrary polygon. In this post we discuss how we built a geostore (i.e., a spatiotemporal database) which was able to satisfy this use case using Google Bigtable and the Google S2 Library and share some of the lessons learned along the way.

Why the Google S2 Library?

We wanted to be able to perform efficient queries of geotagged data, specifically return the most recent N messages posted, relative to a user’s current location, as bounded by a shape. Our previous geostore divided the world into a single grid of fixed size, keyed by a geohash. It would start from the user’s current location and spiral out over the grid to aggregate the latest N messages to return. Even though this system worked, it did not deal with general polygons well, it significantly slowed down as the size of the circle/polygon increased and was very expensive to operate. We needed an indexing strategy that could efficiently support the types of queries that we wanted to handle.

Fortunately, around 2005, Google engineer Eric Veach realized that if you were to place the Earth inside of a cube, you could use a Hilbert Curve, a continuous, fractal, space-filling curve, to map the surface of the cube’s 2D space onto a 1D space and then project that mapping back onto the surface of the earth. This mapping produces a set of 64-bit integers that represent a hierarchical division of the Earth’s surface. These 64 bit integers are called S2 Cells. We will take a practical approach to them, but if you want to go deeper into the mathematics behind this you should checkout this presentation, or this post.

S2 Cells

There are 31 S2 cell levels that form a hierarchical division of the earth. At level 0, the largest cell level, each cell represents a surface area that is approximately 85 million square kilometers. But at level 30, the smallest level, each cell represents approximately one square centimeter! Now, if you want to see if a cell is contained in another cell, all you need to do is a simple range check amongst the cell’s 64-bit ID. Let’s look at a couple of examples to see these traits in practice.

The image below shows a level 10 cell covering an 80 square kilometer portion of Atlanta whose cell ID is 0x88f50f0000000000. Notice that it is not a perfect rectangle due to distortions caused by the cube-to-sphere mapping.

Now this level 10 cell can be further divided into smaller level 11 cells as shown in the image below.

Starting from the top left and moving in a clockwise pattern, the level 11 S2 cells are 0x88f50e4000000000, 0x88f50ec000000000, 0x88f50f4000000000 and 0x88f50fc000000000 and have an average area of about 20 square kilometers.

Because of the way these S2 cells were computed, any S2 cell whose cell ID is in the range [0x88f50e0000000001, 88f50fffffffffff] is contained within the cell whose ID is 0x88f50f0000000000.

You can use this property of S2 cells to efficiently index your geo-tagged data by the S2 cell IDs corresponding to the data’s location and desired resolutionlevel.

Coverings

Coverings are an S2 concept that enables one to approximate a general region using a set of S2 cells whose level-resolution can be controlled. Let us assume that we are given a purple circle centered at latitude 33.848572 and longitude -84.373556 with a radius of 5,000 meters.

If we use the S2 Library to approximate a covering using only level 11 and level 12 cells, it would look as follows.

If we wanted a tighter covering, we could use higher resolution level 12 through level 16 cells and it would appear as follows.

Now, if you indexed your geo-tagged data by S2 cell IDs whose levels match those used to create the covering, the resulting set of covering S2 cells can be used to generate a set of queries to retrieve the data. Any results that are in the approximate covering, but not technically inside the bounding shape, can be dealt with a little bit of post processing.

Google Bigtable

Now that we had an efficient way to index geo-tagged data and perform region bounded queries against it, we needed to pick a persistent store. We wanted a persistent store that was horizontally scalable, had low operational costs, and could support efficient queries for small or large bounding regions. Additionally we did not want to have to run the underlying datastore ourselves. We are obviously big fans of Google Bigtable, with 30,000 QPS for ~$1,500 per month, so it was a pretty easy decision to use it as our persistent store.

Pairing S2 and Bigtable was actually very straightforward. We simply designed our row keys to include the S2 Cell ID at the levels matching the resolution that we wanted together with a reverse time stamp. This combination produces the most recent messages within a given S2 cell ID straight out of the datastore.

Putting It All Together

When a user submits a post, we write several index entries corresponding to the S2 levels that our system supports. When a user wants to get their message feed, we take their current location, compute a radius that will given them an interesting amount of messages, compute the number of S2 cells that would cover that radius, issue parallel queries against Bigtable for the S2 covering cells and finally perform some post processing of the results to customize the feed.

Picking the right tools really does make life easier!

Was it all Peaches and Cream?

No, we hit some gotchas along the way, so let’s share those as well.

S2 Gotchas

Even though we talk about S2 as if it was a single implementation, it isn’t. S2 was originally released as a C++ library https://code.google.com/archive/p/s2-geometry-library/. It has since been ported to Go, Node, and other languages. The issue is that not all of the ports have the full capability of the original C++ code. The Go version of the library, currently claims to be about 40% complete…

If you plan to use S2, take the time understand it and understand whether the port that you plan to use supports all the features that you need.

Google Bigtable Gotchas

Even though we are big fans of Google Bigtable, we did hit some gotchas which caused us some grief.

For every unique name within a column family, Bigtable tracks the values written to it in a cell (not to be confused with an S2 cell). Depending on the GCPolicy of the column family, Bigtable may only keep the latest value written, the most recent N values written, or all values ever written to that cell. If you create your column family using the hbase shell, then it defaults to a GCPolicy which only keeps the latest value written to a cell. However, if you create the column family using the AdminClient and you don’t specify a GCPolicy, the default behavior is to keep all values ever written to the cell.

Things get more complicated because when you read a row and a set of column families out of Bigtable, it defaults to returning all of the values for the cell that it has, as constrained by the column family GCPolicy, from newest to oldest. If you process the read row results in the order returned, the latest cell values may be “overridden” by earlier values. You can control the number of versions returned by specifying the LatestNFilter when reading a row, but since having to specify that or not is a function of how the column family was created, it tends to cause subtle bugs, specially as people move from prototypes to QA and Production environments.

If you do have a need to keep multiple version of each cell, be very careful about how you compose your read options. We ran into an issue where our Bigtable CPU levels showed a steady ramp up over several days approaching the maximum recommended level even though the request volume was more or less steady. It was simply getting more and more expensive to process the same request volume…

Here is the code that has the bug, can you see it?

Here is the actual fix and a new graph showing the resulting CPU load improvement.

The bug was that we were applying the LatestNFilter constraint as the last step in a series of filters. This meant that all versions of the cells were being processed by the server and only then was the latest version kept. The longer the system executed the more versions built up and therefore the more expensive it was to process each request. After fixing the code to apply the LatestNFilter(1) first, the CPU consumption dropped down to expected levels and remained there.

Conclusion

There are several techniques for efficiently indexing geo-tagged data, but S2’s approach on top of Bigtable was a great combination for us. If you do use Bigtable, be mindful of your GCPolicy needs.

In the next post, we will discuss the actual migration.

Migration to Google Cloud Platform — New Geostore

Written by Miguel Mendez