Hexagonalizing the World

A unified geo data layer, part two

Kurt Smith
Jan 30 · 8 min read
St. Petersburg, Port Aransas, and Galveston coastlines, with hexagon overlay. Points are a sample of US real-estate locations.

Coastline modeling

Often the simplest questions can be the hardest to answer. Ever since starting in Geo Data Science at Vrbo, I’ve wanted to know how certain simple physical geographic features impact our business. For instance:

How close is a property to the coast?

Distance to the coast influences a property’s desirability. How can we simply and efficiently measure the distances of properties to the coast, at scale?

  1. Coastlines are fractal in nature — leading to the surprising coastline paradox — so there’s always more detail than we can capture at any resolution.
  2. Travelers to coastal destinations really do want to be right on the coastline, so we have to have a high-fidelity model.
  3. We want to determine this for all properties of interest, so scaling is a definite factor. Spatial queries with complex geometries can be computationally intensive (see part 1); doing it millions of times can take days.

From polygons to hexagons

To approach this, our starting point uses polygons representing worldwide oceans, land masses, and coastlines, provided by OpenStreetMaps.

Large polygons with a lot of details are quite difficult to work with

Instead, let’s reformulate the problem, using H3, to convert the complex and expensive spatial join into a simple SQL inner join. We’ll have to prepare our data first, but the payoff is significant. The sketch of the algorithm looks something like the following:

  1. Buffer each polygon by a small distance to smooth very detailed edges, and transform the buffered polygons back to latitude/longitude coordinates.
  2. Use H3’s polyfill operation to transform each buffered polygon to a set of H3 hexagons at a specified resolution.
  3. Join the collection of H3 sets into one large set, removing any duplicates from overlapping edges.
  4. Compactify the large set of h3 hexagons, since we don’t need a high level of detail in the interior.
  5. For each hexagon, query its neighbors; if all are present, label the hexagon as interior. If any are absent, label it as boundary.
Decorative separator
Decorative separator

H3 does the heavy lifting for us

We can further resolve more boundary hexagons, which allows us to keep track of how far inland the hexagons are each pass. This adds some extra steps to our algorithm which we don’t describe here.

  • Each hexagon labeled as interior or boundary.
  • Each boundary hexagon is tagged with its distance to the coast up to 2.5 km away.
  • 80% of the total hexagon count (4M) is used to resolve the coastline at resolution 8. Only 20% (1M) of the hexagons are required to model the interior of all land masses, and have resolutions ranging from 0 through 7.

From coastlines to properties

Now that we have all landmasses and coastlines represented in H3, the power of this representation becomes clear.

  1. For each listing, generate the nested set of H3 hexagons it is located in, from resolution 8 through 0. This duplicates each listing in our dataset 9 times, one for each H3 level in the hierarchy, and is necessary for the following steps to work correctly.
  2. Inner join our expanded listing table to a compact H3 representation of the Gulf Coast region. This is a plain SQL inner join, and is efficient. Because we are joining against a compact representation of the Gulf Coast polygon, at most one hexagon ID for each listing will remain in the result.
  • State
  • H3 hexagon ID
  • coastline boundary or interior indicator
  • If on boundary, distance to coast in km, otherwise null

Wait, there’s more

If we want to know the coastal distribution for all 170M properties in our dataset, we can simply remove the Gulf Coast hex-join, and do just one hex-join against our H3 coastline data. This takes about 5 minutes in total on a small (5–10 node) Spark cluster.

  • The hex-join we describe here requires some upfront preprocessing of the point data before joining, which requires a call into the H3 library. There are other ways to approach this that do not require a call into the H3 library, but that would increase the runtime of the join.

The bigger picture

This is just one example of what H3 enables. We are using it at Vrbo to help power our beachfront filters. Beyond coastline modeling, we can bring in other geospatial raster data sets with H3, overlay them, aggregate them, and use them for downstream analyses and features for models:

  • Terrain modeling
  • Urban/non-urban modeling

Next steps

This was part 2 in a series. In future installments, we will describe geographic statistical and machine learning models we have built leveraging H3 at Vrbo and Expedia Group.

Expedia Group Technology

Stories from the Expedia Group Technology teams

Thanks to Russell Brown

Kurt Smith

Written by

Expedia Group Technology

Stories from the Expedia Group Technology teams

More From Medium

More from Expedia Group Technology

More from Expedia Group Technology

Rhapsody Is Now Open Source

More from Expedia Group Technology

More from Expedia Group Technology

Using Bash for DevOps

More from Expedia Group Technology

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade