Lightning-fast GeoSpatial Queries using H3

How to cut down the response time for a slow lookup API by 80%

Dakshin Karthikeyan
inspiringbrilliance
5 min readOct 23, 2023

--

Geographical data can be a rich source of information about the world around us. Its practical use cases are never ending ; nearly every popular platform uses geographical data in some form or the other — Google search, Amazon, Twitter/X, the list goes on. Most geospatial queries are fairly expensive to compute. So, when you’re working with large volumes of location data, the ability to quickly execute queries on it can be of critical importance for user experience.

For today’s example, let’s work with the data provided by OpenStreetMap for India as it is simple to write queries on top. Consider the following scenario: you have the location of every building and power station in India saved in your database. You now need to map each building to its closest power station.

A first pass of the solution is fairly easy to do if you have the data indexed in a database like MongoDB; we can use the geospatial operators provided by it to find matching entries between the building and power collections.

The problem? This query can be agonizingly slow! It took over 30 seconds when operating on a small subset of the data (more details on the query and performance results are available at the end of this article).

The time taken will also grow exponentially as we add more data since the database needs to compute the distance between each building in our inventory and all the power stations. This becomes important because data volumes will only keep growing with time especially as we expand the platform into more markets.

Note: If you would like to follow along with this article to play around with the OpenStreetMap data, this lists down the setup required to download the required data and import it into MongoDB.

Why is the existing solution slow?

Before diving into building a new solution or trying to make optimizations, it is important to understand why we’re working on it in the first place, i.e. why does the current query take so long?

The answer: Sheer data volumes.

To put things into perspective, the OpenStreetMap data has over 10 million buildings to be compared against 1.5 million power stations. That’s 10M*1.5M = 1.5 * 10¹³ comparisons to be done by the database.

Therefore, the most effective way to speed this up is to cut down on the number of comparisons that need to be made. For example, we don’t need to calculate the distance between a building in New York and a power station in London since the matching ones will obviously be in New York. Generalising that a little further: let’s first find all the locations that might possibly be a match by using a fast approximation method, then we can do the slow and precise calculations on the smaller subsets.

So how do we do the ‘fast approximation’ ?

Introducing Geo-indices!

A geo-index is essentially a way to map a certain area on the globe to a string. This mapping is designed in such a way that the entire planet is covered by a shape of roughly uniform area. The shapes are available in various sizes, so we can choose one that’s best suited to our needs. Therefore, to find all the buildings within some area on the map, we can do a simple string comparison between their known geo-codes and the ones we need to query for.

There are quite a few options available for a geo-index, the most notable being GeoHash, Google’s S2 and Uber’s H3. If you would like to understand these in more detail, Uber’s introductory post on H3 should help.

For our use case, we went ahead with H3, mainly because the library provides a grid disk traversal method that makes it very easy to identify all the shapes within a certain distance from another shape. Take a look at how that works — given the location of any shape, the library can help find the neighbouring shapes to any depth:

A representation of grid disk traversal for a H3 shape. Image source: https://www.uber.com/en-IN/blog/h3/

So how exactly do we use this?

Now that we have all this knowledge, here’s a summary of the changes that we make to map each building to its closest power station using H3:

First, we need to ensure that the H3 index is available for each document in the collection. We can calculate the H3 value for each document at our chosen precision level once and include it as an attribute in the database.

Once that’s done, here’s how we find the nearest power station for a specific building:

  • Find the H3 index for that building
  • Use the H3 Grid-Disk method on this index value to find all the shapes surrounding it (the depth to which we traverse the grid would depend on the size of the polygons in your chosen precision level and the maximum distance up to which you need to search for the closest power station).
  • Look up the power stations which fall in one of the above shapes and get their exact coordinates.
  • If there’s more than one power station in the vicinity of that building as reported by the lookup, fall back to a slow and precise calculation of the distance to the building to find the closest one (note that we would still do this only for the power stations which are close to the building).

Something to note here is that with this change, we are no longer dependent on the query operators supported by MongoDB. Instead, we mostly use the database to lookup buildings or power stations by their H3 index. To make this even faster, we could move the data away from MongoDB to a key/value store like Redis (where the key is a H3 index, mapping to the set of all power stations located at that index) for even faster lookup times.

Profit!

With the method described above, the lookup time does not grow exponentially with the number of locations any more, but instead, is linear with respect to the number of power stations near a specific location. We can reduce that further by playing around with the H3 precision levels and even use H3’s built-in distance approximation methods if we don’t really need very precise distances.

For our use case in production, we were able to cut down the time to calculate the nearest location around 45,000 points down from 3.5 minutes using Mongo queries to just 15 seconds using H3 indices stored on Redis.

I hope this article has been a good insight into the power of geo-indices. If you have any thoughts or questions, do share in the comments!

Appendix

Details on the MongoDB geospatial queries used to run the benchmark tests and the total time taken as the volumes go up are documented below:

--

--