EXPEDIA GROUP TECHNOLOGY — DATA SCIENCE

Hexagonalizing the World

A unified geo data layer, part two

Kurt Smith

Published in

Expedia Group Technology

8 min readJan 30, 2020

St. Petersburg, Port Aransas, and Galveston coastlines, with hexagon overlay. Points are a sample of US real-estate locations.

In part 1 of this series, we shared why we used the H3 library at Vrbo™ (part of Expedia Group™) and the datasets we are building on top of it. Now we’ll focus on one application of H3 to coastline modeling.

Our primary goal with this project was to put H3 to the test at a global scale. We learned several things about coastlines along the way.

In short, we can use H3 to represent all 150 million square kilometers of land efficiently at kilometer-level resolution. With this, we can transform an expensive point-in-polygon spatial join that takes days into an inner join that takes minutes. And the inner join is on 100X the data.

Coastline modeling

Often the simplest questions can be the hardest to answer. Ever since starting in Geo Data Science at Vrbo, I’ve wanted to know how certain simple physical geographic features impact our business. For instance:

How close is a property to the coast?

Distance to the coast influences a property’s desirability. How can we simply and efficiently measure the distances of properties to the coast, at scale?

From a data science perspective, if we can’t do that efficiently for the millions of properties we’re interested in, then we can’t do our job.

What makes this difficult to compute is:

The problem size is large: about 150 million square kilometers of land — 64 million square km of habitable land — split up among tens of thousands of land masses.
Coastlines are fractal in nature — leading to the surprising coastline paradox — so there’s always more detail than we can capture at any resolution.
Travelers to coastal destinations really do want to be right on the coastline, so we have to have a high-fidelity model.
We want to determine this for all properties of interest, so scaling is a definite factor. Spatial queries with complex geometries can be computationally intensive (see part 1); doing it millions of times can take days.

From polygons to hexagons

To approach this, our starting point uses polygons representing worldwide oceans, land masses, and coastlines, provided by OpenStreetMaps.

This dataset has about 700k incredibly detailed polygons that tile all land masses on the planet. The larger land masses are broken up into many smaller polygons, since large polygons with a lot of details are quite difficult to work with. This means that any given polygon may have both interior and boundary edges that we have to figure out.

One approach to our overall problem would be to do a spatial join of these 700k polygons against our millions of properties. But we’d first have to resolve the interior vs. true coastline edge issue. Once we did that join, we’d then have to determine the distance of each contained unit to the true coastline. Even with specialized spatial databases, this approach would be challenging.

Large polygons with a lot of details are quite difficult to work with

Instead, let’s reformulate the problem, using H3, to convert the complex and expensive spatial join into a simple SQL inner join. We’ll have to prepare our data first, but the payoff is significant. The sketch of the algorithm looks something like the following:

Project the OpenStreetMaps polygons from latitude and longitude coordinates to locally Cartesian coordinates, using the transverse Mercator projection.
Buffer each polygon by a small distance to smooth very detailed edges, and transform the buffered polygons back to latitude/longitude coordinates.
Use H3’s polyfill operation to transform each buffered polygon to a set of H3 hexagons at a specified resolution.
Join the collection of H3 sets into one large set, removing any duplicates from overlapping edges.
Compactify the large set of h3 hexagons, since we don’t need a high level of detail in the interior.
For each hexagon, query its neighbors; if all are present, label the hexagon as interior. If any are absent, label it as boundary.

This is the basic algorithm. H3 does the heavy lifting for us, and the output is a compact, discretized representation of all land masses, planet-wide. We also label which hexagons are boundary and which are interior.

The reason we use the compact representation is to significantly compress the H3 representation of land masses. Without compacting, our dataset would require nearly 700M hexagons at resolution 8 to represent the entire land area of the planet. Using the compact representation reduces this by a factor of over 100 and is one significant advantage of using H3 rather than simpler raster (or geographic grid) formats.

H3 does the heavy lifting for us

We can further resolve more boundary hexagons, which allows us to keep track of how far inland the hexagons are each pass. This adds some extra steps to our algorithm which we don’t describe here.

The end result:

5M hexagons of resolutions ranging from 0 to 8.
Each hexagon labeled as interior or boundary.
Each boundary hexagon is tagged with its distance to the coast up to 2.5 km away.
80% of the total hexagon count (4M) is used to resolve the coastline at resolution 8. Only 20% (1M) of the hexagons are required to model the interior of all land masses, and have resolutions ranging from 0 through 7.

From coastlines to properties

Now that we have all landmasses and coastlines represented in H3, the power of this representation becomes clear.

For example, let’s find an approximate distribution of US real estate along the Gulf Coast, using our dataset.

Our geographers have specified a large polygon that defines the Gulf Coast touristic region. We can follow the steps above to generate an H3 compact representation of this polygon for use below.

Normally we would do a point-in-polygon operation, with an ST_Contains operation in Presto or PostGIS, for example.

We can replace that spatial join with what you might call a hex-join. The steps look like the following:

Filter real estate listings located in Texas, Louisiana, Alabama, Mississippi, or Florida
For each listing, generate the nested set of H3 hexagons it is located in, from resolution 8 through 0. This duplicates each listing in our dataset 9 times, one for each H3 level in the hierarchy, and is necessary for the following steps to work correctly.
Inner join our expanded listing table to a compact H3 representation of the Gulf Coast region. This is a plain SQL inner join, and is efficient. Because we are joining against a compact representation of the Gulf Coast polygon, at most one hexagon ID for each listing will remain in the result.

This gives us a subset of properties that are within the Gulf Coast region.

With this gulf-coast subset, we can do another hex-join to the coastline H3 data set above.

The final data set has these columns:

Property
State
H3 hexagon ID
coastline boundary or interior indicator
If on boundary, distance to coast in km, otherwise null

The overall runtime in Spark is on the order of a few minutes on a small cluster, with a final size of approximately 6M real estate properties when limiting to the Gulf Coast.

Wait, there’s more

If we want to know the coastal distribution for all 170M properties in our dataset, we can simply remove the Gulf Coast hex-join, and do just one hex-join against our H3 coastline data. This takes about 5 minutes in total on a small (5–10 node) Spark cluster.

Compare this to our previous version that used a spatial join on just a few million properties with a spatial database — due to the complexity of the worldwide coastline polygon, the same operation took several days, even after distributing the problem across a large cluster!

Using the hexagon representation has some tradeoffs:

The H3 representation of a polygon is not perfect. We can control the precision based on the H3 resolution when doing the polyfill operation. For this example, we’re using a resolution of 8, and we see massive performance gains for spatial resolution of about a half a kilometer (see H3 resolution table). Going to resolution 9 or 10 would increase the size of our coastline representation and would increase the runtime of the hex-join, but not significantly.
The hex-join we describe here requires some upfront preprocessing of the point data before joining, which requires a call into the H3 library. There are other ways to approach this that do not require a call into the H3 library, but that would increase the runtime of the join.

The bigger picture

This is just one example of what H3 enables. We are using it at Vrbo to help power our beachfront filters. Beyond coastline modeling, we can bring in other geospatial raster data sets with H3, overlay them, aggregate them, and use them for downstream analyses and features for models:

Lake coastlines
Terrain modeling
Urban/non-urban modeling

Once each is represented in H3, we can easily co-locate disparate features into a local geographic region.

None of this is new — GIS systems have supported raster data ever since the dawn of computing. But the other features of H3, such as uniform areal units, compact and uncompact operations, and fast nearest-neighbor lookup functionality, make it a powerful system for these sorts of representations.

Next steps

This was part 2 in a series. In future installments, we will describe geographic statistical and machine learning models we have built leveraging H3 at Vrbo and Expedia Group.