A Unified Geo Data Layer, Part One
Simplifying geography with hexagons
Geospatial data is really hard to work with. It requires specialized libraries and file formats as well as an understanding of the arcane art of map projections. At Vrbo™ (part of Expedia Group™), geography is very important to the business. Geography is the foundation of our search product. Geography defines market competitors and influences optimal pricing. Regulations and tax laws operate within geographic boundaries. Property owners ask questions like:
How often do people book properties in my neighborhood?
Setting aside the challenge of defining a neighborhood (is it a block? a street?) and assuming the neighborhood is represented as a polygon, we’d need to figure out which properties are inside that polygon and count their bookings. This requires a point to polygon join. We must take every property in the system and ask it if it’s in the polygon. There are software libraries that do this pretty quickly, and if there’s just one polygon, it’s not awful. What if we want to answer this for every owner? Well, that means polygons for all neighborhoods, not just that of a single property owner. For the United States, one version of “neighborhoods” (built by a specialized, custom variant of HDBSCAN) looks like this:
That’s a lot of polygons and we need to ask each one which points it contains. It’s not pretty. We can use a specialized data structure used by Elasticsearch and PostGIS that very quickly indexes the polygons by bounding box to reduce the search space considerably. This adds yet another specialized tool just to answer this simple question.
Another way to do it, avoiding the point-polygon join as well as the difficult definition of a neighborhood, is to use a smoothing technique. If we smoothed the bookings made by neighbors then we can just check the value of that function at the unit location and we’d have a representation of bookings (though not necessarily an easily interpretable one). One of the ways we did it in the past was to use a technique called kernel density estimation (KDE), in which we drop a smoother on top of each point and sum up all the contributions according to this equation:
The kicker is that these smoothers usually drop as a function of distance, so we need a way to calculate distance quickly. Turns out that’s not trivial.
There are a couple ways to calculate this distance. The reason it’s hard is the earth is not flat. The first way is to use what’s called the haversine, or great circle distance. This distance metric assumes the earth is a sphere (it’s not), and is really slow to compute if you have to repeat it a bunch of times (we do). The other way is to use a map projection that preserves distance. There are tons of map projections, but the simplest one for general purpose local distance calculations is transverse Mercator. This flattens the earth along a specified longitude over the ellipsoid, so it doesn’t assume a sphere. Once the coordinates are done, you can just use regular planar distance and everything is okay. The downside is that the farther you get away from the longitude where you centered, the less accurate the distance becomes. So it won’t work for large shapes. That said, we operated on local areas pretty quickly with the projections and got something that looks like this:
This is a heat map and it looks nice, but the estimator powering it is hard to work with. Once the estimator is “trained”, it can’t just give us all the points on earth and their scores; we have to ask it for specific points. This makes precomputation nearly impossible — to serve this online we’d have to calculate the scores on the fly, which, as shown above, requires a sum for each prediction. That sum takes longer the more data the estimator has, so anything on the scale of properties in the US would be prohibitively slow.
To precompute these scores we could drop a grid of points down and just snap any points in our dataset (new or otherwise) to our grid. Then we can retrieve the snapped, precomputed value instead of calculating it. If the model or product can tolerate this precision, then we just saved a bunch of time. So why bother with points in the first place? If we could tile the world and compute the score for each tile then we’d keep the same precision and might be able to avoid the distance/point-polygon calculation.
This is not a new idea. What I just described above is called a geographic hash, and it is exactly what it sounds like — a hash function for geographic points.
Each string represents some geometry. Every point within that geometry gets the same hash. There’s also a way to select the size of the tiles, so the precision can be tuned. This technique is perfect for this kind of application because it’s more important to retrieve the data than to get the distances to go fast. Our application is different from cartography — we don’t need a precise boundary; we need an easy way to get the data for analysis and exploitation.
There are several systems that hash geographies:
Geohash is the most well known — it takes the latitude / longitude and hashes those to a preconfigured number of bits, representing it as a string. It’s got some flaws, namely that the shapes are rectangles and cover different physical areas depending on latitude. It would introduce a very nasty bias in our system if we were to view data this way. S2 corrects for this by using geodesic squares, so that each cell has the same area. If we want to smooth these cells though, we need to retrieve each cell’s neighbor. For a geodesic square there are two types of neighbors: edge-to-edge neighbors (up,down,left,right) and the diagonal neighbors. The diagonal neighbors are a different distance from the center cell than the edge neighbors. It’s workable but inconvenient. Turns out a square isn’t the best shape for this work; we need more sides so we only have one kind of neighbor. This is why Uber built H3, which is a hashing function that maps points to hexagons.
Hexagons have identical distance to all neighbors from the central cell.
They’re also hierarchical, though not as strictly as the squares and geohashes (meaning points near the edge of a hexagon may map to a different parent even though they have the same hash).
This blog post outlines the technical details of the H3 system (and is the source of the above images) — so definitely check that out if you’re interested in learning more. The focus of this series is to demonstrate what we can do when we use H3 to index our geospatial data.
So let’s revisit the question:
How many bookings are in our neighborhood?
Assuming we select an appropriate hexagon size (referred to as resolution), this becomes very simple: hash the unit locations, group by the hash, and add the bookings.
Now we can answer that question not only for every existing unit in the system, we can do it for new units too. Just hash the new unit’s location and look up the hash. No specialized database required.
We’ve been using H3 in production at Vrbo for months. Its primary use cases fall into two broad categories:
- Easy aggregation of geospatial data at large scales, and
- Fast serving of geospatial data for online use.
We have one model in production that answers the question above about booking densities for specific places. Previous versions of this model required small-scale market-by-market analysis because it was based on kernel density estimators. With H3, the aggregation and smoothing can now run in Spark as a Pandas UDF, and takes only minutes each day.
When a property owner starts the process of listing their property on our site, they’re greeted with an estimate of the rent they could earn.
This estimate is powered by a machine learning model combined with a geospatial search. We recently upgraded this model to use H3 for the geospatial search (previously we used Elasticsearch). This allowed us to precompute more up front by predefining the search boundaries, and it changed two spatial Elasticsearch queries into a single lookup.
One non-machine learning use case is our beach front filter:
The first version of this filter, which was launched in an A/B test, was calculated using a SQL server. It required several days of manually executing geospatial joins on different earth regions to compute. With H3, we were able to precompute which hexagons were coastlines worldwide. Combined with precomputing the hexagon IDs of all the properties in our system, what was once a geospatial join became a simple string join, and takes just minutes to run on a distributed compute system like Presto.
This was a light introduction to why H3 exists and what we’re doing with it here at Vrbo and Expedia Group. The next two parts of the series (by my colleague and H3 contributor Kurt Smith) will be a deeper dive into the coastline work mentioned above as well as advanced data science and analysis techniques.