Computational infrastructure behind Habidatum Location Risk Score

Nikita Pestrov
Habidatum
Published in
7 min readJan 5, 2022

Over the last few years, Habidatum has been working to make real estate property location a measurable and manageable asset. Habidatum’s Location Risk Score is the core metric that supports it and decouples real estate properties from their locations.

Location Risk Score calculation is global and real time which naturally brings its technological challenges. To overcome them, we created a set of data aggregation, processing and analysis technologies that we’d like to explain to you in a series of publications.

We will begin by describing the data storage and aggregation infrastructure that enables the Location Risk Score calculations.

Comparing locations in space and time

Location Risk Score is a relative metric comparing myriads of locations within a city (MSA) or region, country or the globe. To create it, we compare multiple locations in space and over time. This requires the score and underlying metrics to have spatial and temporal dimensions, for which we have an underlying database supporting space and time indexing.

Chronotope Grid is the database allowing to calculate the Location Risk Scores consistently through space and time, globally and across a variety of time slices

Chronotope Grid database unit: space-time cell

Chronotope Grid is built by Habidatum on top of a columnar database engine (Clickhouse). We utilize a unique spatio-temporal data storage schema that relies on quadkeys to generate a regular spatial grid covering the whole globe. This allows us to merge any dataset (like points of interest and mobile phone data for the Location Risk Score) spatially and calculate the datasets’ derivatives at a granular grid level, comparable across the country or globe.

Quadkeys are used by many companies in the spatial data industry — by Microsoft in the Bing Maps, or by Here maps for location specification. Our top collaborator, Mastercard, uses quadkeys for the GeoInsights product, providing accurate aggregated consumer spending data at a detailed spatial level.

What makes Chronotope Grid special is its support of the temporal data dimension together with spatial indexing — the ability to quickly slice and dice a location and retrieve its performance across multiple indicators and time periods.

This, along with the ability to host worldwide data, while zooming in to a particular location, made the Chronotope Grid the basis for most of our projects: the examples include exploring the patterns of behavior at Heathrow airport to guide its commercial expansion, understanding the beats of culture in Denver for its premiere arts complex redevelopment, or studying the mobility around university campuses in Florida for better transit connections development.

Chronotope Grid also supported to the creation of Inclusive Growth Score, a product built by Mastercard Center for Inclusive Growth and powered by Habidatum’s technologies. Internally, it houses over 25 million records for 18 metrics across all the 74+ thousand census tracts in the US and is open to the public. Every census tract is benchmarked against all other tracts in the state, nation, or tracts with the same level of urbanization across these metrics and 5 years.

For Location Risk Score, we are using Chronotope Grid to turn two continuous datasets — mobile phone activity (mobility patterns) and locations of points of interest (commercial density and diversity) — into a discrete and operational set of metrics that feed into the scoring algorithm. This way, the input data points, containing billions of signals and POI locations, turn into the aggregate values at over 550 million grid cells covering the US, with multiple time periods spanning dozens of months.

Let us cover the quadkey indexing in more detail.

Quadkeys as a spatial data indexing strategy

Quadkey is a spatial data indexing technique that allows quick retrieval of data for a particular location. The idea is to split the whole world (under the Mercator projection) into a set of tiles and store the tile identifier for each data point.

The main advantage of the quadkey indexing is that it provides a way of hierarchical binning: depending on the use case, you can retrieve the data for a 5x5 km tile or a 10x10 meter tile, as an example.

To explain the hierarchical binning, let’s introduce the concept of zoom levels: at zoom level 1, the whole world is divided into 4 tiles: denoted by 4 letters A, B, C and D (image on the left). To move to zoom level 2, we further divide each tile into 4 tiles, producing 16 tiles, each represented by 2 letters (AB, CD, BA, etc). For zoom level 3, the world map looks like what is shown in the image on the right, and each tile is represented by 3 letters (ACD, DBA, etc.). The number of tiles at each zoom level is 4 to the power of zoom level (64 for zoom level 3 and 4 294 967 296 for zoom level 16, representing a tile of ~ 450 meters side).

World split to quadkeys of zoom level 1 vs World split to quadkeys of zoom level 3

What is important is that tile ACD is located within the tile AC, which in turn is a part of tile A. This allows us to aggregate data at any zoom level just by truncating the quadkey index of a point stored in the database to the number of characters, equal to the zoom level. So, to get all the points located in the same tile of zoom 5, we would group by the first 5 characters of the quadkeys.

It is also important to note that there are multiple ways to represent a quadkey:

  • The 2-bit interleaved address (a string, with ABCD or 0123 base characters) representation, where each 2-bit character stands for the corresponding zoom level component. This is the most intuitive form.
  • A variation of this address representation is binary form of 01 00 11 10 10, where each 2 bits stand for a zoom level.
  • An unsigned integer representation (64-bit integer is sufficient to store up to 31 zoom levels of the quadkey). This is the most efficient form.
  • The form of 2-d ranged coordinates (x, y) — pixel coordinates, indicating the position of the quaditle on the map.

If you are interested in exploring quadkey indexing for your spatial data studies, there are a few open-source libraries that provide a functionality of turning any point to quadkey and additional spatial operations.

A pyquadkey library for Python might be a good starting point — it implements a version of quadkeys proposed by Microsoft and provides a variety of methods to interact with a quadkey and convert in and out of the quadkeys.

Let’s code: using quadkey to find the top locations by the amount of facilities in NYC

Here are the steps you can take to utilize the quadkeys library to index the NYC facilities database and find the locations with the highest counts of facilities located within ~100-meter grids. We will be using Pandas to read the csv file into the dataframe.

First, you will need to identify the appropriate zoom level of the quadkey to match the desired 100 meter cell size. The size of the quadkey tile decreases with the latitude of the point, so while the side of the 18 grid tile at the equator is equal to 75 meters, in NYC it is around 116 meters. The code snippet below loops over 23 zoom levels to find the one that matches the desired 100 meter cell side the best.

Turns out, the zoom level 18 is the best match for the desired cell side, providing 116-meter grids in NYC.

Next, let’s add the quadkey column to the dataframe, converting the latitude and longitude within the 18 zoom level:

The dataframe with the quadkey column — the first 9 characters are the same

Now we can group the rows by the quadkey and sort the results to find the locations with the most facilities in them.

We can see that the highest-count grid cell is the one with quadkey of 032010110132023002. We can visualize it on the map using the function to_geo().

Left: The top 5 quadkeys with the corresponding number of facilities in them. Right: The top quadkey on the map

We can also check if the surrounding grid cells have the same high numbers of facilities or whether this was an outlier by using the near() function. Turns out, this cell has way more facilities than the ones surrounding it.

This example demonstrates one of many ways you can utilize the quadkey concept to answer some of the spatial analysis questions.

What lies ahead

In this publication, we explained the basic term of the quadkey, which is an essential component of the Location Risk Score.

Quadkey is what also allows us to make data-driven spatial clusters from the grid: by finding the neighboring grids using the indexing system and checking if your grid cell is a local maximum. The following image schematically displays the algorithm of local maximum detection.

Turning quadkey grids into location clusters

Taking the best practices of spatial indexing and applying them to new location analysis challenges make the on-the-fly calculations of the location scores possible at national and worldwide scales.

There are other components of making it possible in Habidatum’s work: Chronotope Platform that runs metric calculations on top of the database, the data processing pipelines and the novel ways to normalize the highly dynamic mobility data to make it comparable in space and time. We will cover these components and more in the upcoming publications. Stay tuned!

--

--

Nikita Pestrov
Habidatum

Data Science Lead at Habidatum, ex easyten co-founder, Skoltech MSc graduate, MIPT graduate