Using Uber H3 spatial indexing to get city-wide POI data from Google Places API

Hisham Sajid
12 min readJun 2, 2024

--

Places POI data layered over H3 cells at resolution 6

A POI or Point of Interest is essentially the location and other details of any place on a map that someone might find useful or interesting, such as businesses, landmarks, amenities. In the context of Google Maps, when you look for your favourite restaurant, you see important details about it, such as it’s location, name, address, phone number, website, opening and closing time etc. — one such establishment and it’s accompanying data is considered one Point of Interest. Google stores, categorizes, and updates data for millions of such locations or POI world-wide.

TL;DR: If you know what you’re doing, all code and data are available here.

As the title very well states, this entry is about developing a technique that leverages Uber’s H3 spatial indexing algorithm to extract Point of Interest (POI) data for large geographies / areas like cities from the Google Places API.

Before we begin, two very important questions need to be addressed:

a) Why is this (POI data) important?
b) Why do we need this (algorithm)?

Why is this Important?

In order to answer why this is important we need to understand the value of POI data. Let’s consider two different examples.

  1. I am a Consumer Packaged Goods (CPG) company that focuses on marketing and selling tea and herbal infusions (hint hint) and I am looking to expand my B2B or Out of Home offering. That is, I want to expand the part of my business where I sell directly to schools, hospitals, hotels, restaurant chains etc. Imagine going into a new city and already having detailed information about all the top places you can sell to in the relevant categories, along with their phone numbers, addresses, and websites.
  2. I am a research institute or think tank, and am interested in measuring access to health care in rural Sindh (a province in Pakistan). For this I need access to all hospitals and clinics in the province, so that I can run some form of analysis on accessibility. As good quality data is rare in developing countries, I am forced to look for alternate data sources. Again, I require high quality POI data to run my analysis.

In summary, POI data if of value to both commercial entities selling soap and non-profits looking do policy research. Any method that can optimize the cost and time required to extract this data is hence, valuable.

Why do we need this?

If you’re not familiar with it, I recommend spending a few minutes on the documentation for the Google API places here:

Once you go through the docs a couple of times, you will notice a key limitation — there is no API call that lets you get all POI within a given polygon, such as entire cities or districts, unlike OSM (different topic for a different day). Whereas Google does provide a Nearby Search option that lets you look for places within a defined radius in meters and POI type, you cannot get data for more than 60 POI in one API call (20 results/page, 3 pages / API call). Meaning if you want to get data for larger geographies you need to find a creative solution.

The Algorithm

Of course you can default to a brute force approach, and call the API every 100 meters of your target area but then the Places API is not cheap at $32 / 1000 requests. Costs can significantly stack up especially if you’re searching across big cities, where some areas are more densely populated (and have more POI) than others.

The alternate? We use a spatial indexing algorithm to ‘saturate’ the Places API at a higher resolution i.e. if it does reach the 60 POI limit we go the next lower resolution in the index. In order to explain what that means, let’s first look at Uber’s H3 spatial Indexing.

H3 divides the world into hexagonal cells at different resolutions, allowing for scalable and hierarchical spatial indexing. Each hexagon can be recursively subdivided into smaller hexagons, providing multiple levels of resolution. A separate essay could be written on why they use hexagons and not squares, or triangles, or any other shape, but that is not within the scope of this post. If you want to find out more about H3 you can read this post by Uber.

Subdivision into smaller and smaller hexagons. Source: Uber

With this cursory understanding of H3 indexing, we can begin to see how the algorithm will work.

  1. Create a polygon over the area you want to find POI data, and generate a grid of h3 hexagons at a higher resolution
  2. You call the API at a higher h3 index / spatial resolution, with radius equal to (or slightly larger than) the radius of the hexagon.
  3. If the API returns 60 results, you go the next lower resolution of cells within the original hexagon i.e the children cells. You assume that if you’re reaching the limit there are probably more data points you’re not getting.
  4. Now at these child cells of the higher resolution parent, you call the API again where the radius is equal to the radius of this child cell.
  5. Repeat this process (step 3 and 4) until you parse over all the highest resolution cells in your overall grid.
  6. Use the place_id attribute from the Places API to deduplicate your data.

Setting up the Places API

The very first thing you need to do is set up a Google Cloud Platform account if you don’t have one already. You will need a debit or credit card to do this, however Google gives you credit worth 200 USD every month and this experiment stays well within this free limit. You will need to enable the Google Places API and get an API key that will be used with each request.

I will not deep dive into these steps since there is already excellent documentation out there from Google for this (if you’re not already familiar with how it works). You can follow the following posts, in order:

If you follow these steps correctly, you should end up with an API key.

Code

Now let’s try to understand what all this will look like in code.

First things first, we import all the packages we will need.

import pandas as pd
import geopandas as gpd
import numpy as np
import requests as re
import folium
import h3
import tqdm
import requests
import json
import time
import os
import math
import ast

from shapely import geometry
from shapely.ops import unary_union, transform

pd.set_option('display.max_columns',None)

# Please do not keep the API key in the open like this in
# an actual experiment, please.
API_KEY = 'API_KEY_HERE'

Next, we select the area we want to pull data from. Let’s pick Karachi, the biggest city in Pakistan with a population between 18 and 25 million (depending on who you ask).

# Shapefiles will be provided in the github repository
districts_gpd = gpd.read_file('../data/admin_boundaries/Adminbdy Shapefile/District_Boundary.shp')
karachi_districts = ['KARACHI CENTRAL', 'KARACHI WEST', 'MALIR CANTONMENT',
'KORANGI CREEK CANTONMENT', 'MANORA CANTONMENT',
'CLIFTON CANTONMENT', 'KARACHI CANTONMENT', 'FAISAL CANTONMENT',
'KARACHI SOUTH', 'MALIR', 'KORANGI', 'KARACHI EAST']

# Unary union to combine all districts into one polygon object
khi_boundaries = unary_union(districts_gpd[districts_gpd\
.DISTRICT\
.isin(karachi_districts)==True]['geometry'])

# Small buffer to smoothen the edges
khi_boundaries = khi_boundaries.buffer(0.05)
khi_boundaries
khi_boundaries polygon

Next we get the h3 cells within the bounding polygon / Karachi city at resolution 6

# Flip coordinate sequence from longitude, latitude 
# to latitude, longitude as this is how the h3 API reads it
def flip(x,y):
return y,x
khi_boundaries = transform(flip,khi_boundaries)

# Convert to polgyon geojson object
khi_boundaries_gejson = gpd.GeoSeries([khi_boundaries])\
.__geo_interface__['features'][0]['geometry']

# Get h3 cell IDs for cells within the bounding polygon / geojson
res6_khi_hex = h3.polyfill(khi_boundaries_gejson,6)

The res5_khi_hex now has cells IDs that combine to create a hexagonal grid over Karachi — what does this ‘hex grid’ look like?

def func_visualize_hexagons(hexagons, color="red", folium_map=None):
"""
original source: https://nbviewer.org/github/uber/h3-py-notebooks/blob/master/notebooks/usage.ipynb

hexagons is a list of hexcluster. Each hexcluster is a list of hexagons.
eg. [[hex1, hex2], [hex3, hex4]]
"""
polylines = []
lat = []
lng = []
for hex in hexagons:
polygons = h3.h3_set_to_multi_polygon([hex], geo_json=False)
# Flatten polygons into loops.
outlines = [loop for polygon in polygons for loop in polygon]
polyline = [outline + [outline[0]] for outline in outlines][0]
lat.extend(map(lambda v:v[0],polyline))
lng.extend(map(lambda v:v[1],polyline))
polylines.append(polyline)

if folium_map is None:
m = folium\
.Map(location=[sum(lat)/len(lat),
sum(lng)/len(lng)],
zoom_start=9,
tiles='cartodbpositron')
else:
m = folium_map
for polyline in polylines:
my_PolyLine=folium.PolyLine(locations=polyline,weight=4,color=color)
m.add_child(my_PolyLine)
return m

m = func_visualize_hexagons(hexagons=res6_khi_hex)
display(m)
Hex grid for Karachi at resolution 6

Before we move forward — a word on H3 resolutions:

H3 contains a total of 16 resolutions as described in the table below, and each resolution has a certain number of hexagons that span the entire earth as a layer ranging from 122 hexagons in the highest layer and about 500 trillion hexagons at the lowest layer. Each layer consists of a more granular level of hexagons and each hexagon of every layer has its own unique ID.

Source: https://h3geo.org/docs/core-library/restable/

This hierarchal nature of spatial indexing makes it an ideal fit for our purpose. We can chose the highest resolution we want to ‘drill down’ from. In this experiment, I’ve picked resolution 6 based on trial and error. The hexagon IDs look something like this 864219207ffffff , 86421920fffffff, 86421921fffffff’, 86421922fffffff, 864219247ffffff

Now that we have a basic understanding of H3 spatial indexing and what we mean when we say H3 cell resolution, let’s write a function that calls the Places API. This will be called multiple times when parsing over the city hex grid.

def func_get_places_poi(lat,lng,resolution,type,api_key):

places_df = []

# nearby sarch API URL
url = "https://maps.googleapis.com/maps/api/place/nearbysearch/json"
params = {
'location': f'{lat},{lng}',
'radius': resolution,
'types': type,
'key': api_key}

response = requests.get(url, params=params)
results = json.loads(response.content)
places_df.append(results['results'])

while 'next_page_token' in results:

time.sleep(1)
params['pagetoken'] = results['next_page_token']
response = requests.get(url,params=params)
results = json.loads(response.content)
places_df.append(results['results'])

results_df = pd.concat([pd.DataFrame(df) for df in places_df])
results_df = results_df.reset_index(drop=True)

return results_df

# Location for Karachi DHA Phase 6, Khayaban-e-Sehar
lat,lng = 24.794386, 67.048928

# see all POI types for places API here: https://developers.google.com/maps/documentation/places/web-service/supported_types
results_df = func_get_places_poi(lat=lat,lng=lng,
resolution=2000,
type='cafe', #
api_key=API_KEY)

results_df[['place_id','name','types','user_ratings_total','vicinity']].head()

The code above uses the Nearby Search API call, passing parameters like latitude, longitude, POI type, and resolution in meters i.e the radius in which to search from the origin coordinates. It parses the API result to return a neat DataFrame that we conveniently call results_df .

This is how part of it looks like.

If we run a simple len(results_df) command we will get a count of 60 which means that we have most likely saturated the API results, and there’s likely more locations in the given resolution.

Next we need to create a grid with multiple resolutions, this will be the canvas on which the API will parse. At the same time we calculate the radius of each hexagon, with a bit of buffer for good measure, and compute the centroid. The radius will be passed into the resolution variable of the func_get_places_poi function we wrote above.

res6_khi_hex = pd.DataFrame({'hex_id':list(res6_khi_hex),'hex_resolution':6})

res7_khi_hex = h3.polyfill(khi_boundaries_gejson,7)
res7_khi_hex = pd.DataFrame({'hex_id':list(res7_khi_hex),'hex_resolution':7})

res8_khi_hex = h3.polyfill(khi_boundaries_gejson,8)
res8_khi_hex = pd.DataFrame({'hex_id':list(res8_khi_hex),'hex_resolution':8})

res9_khi_hex = h3.polyfill(khi_boundaries_gejson,9)
res9_khi_hex = pd.DataFrame({'hex_id':list(res9_khi_hex),'hex_resolution':9})

khi_hex_df = pd.concat([res6_khi_hex,res7_khi_hex,res8_khi_hex,res9_khi_hex])
khi_hex_df.reset_index(drop=True,inplace=True)

def func_get_hex_radius(row,BUFFER):
edge_length = h3.edge_length(row.hex_resolution,'m')
radius = edge_length
radius = radius+edge_length*BUFFER
return radius

khi_hex_df['hex_area'] = khi_hex_df.hex_id.apply(lambda x: h3.cell_area(x,unit='m^2'))
khi_hex_df['hex_radius_places_api'] = khi_hex_df.apply(lambda row: func_get_hex_radius(row,0.15),axis=1)
khi_hex_df['centroid'] = khi_hex_df.hex_id.apply(lambda x: h3.h3_to_geo(x))
khi_hex_df.head()

You should end up with a khi_hex_df DataFrame that looks like this

Now we have all the pieces of the puzzle in place to run the core of our algorithm, keeping the highest resolution at 6 where the radius is around 3700m, iteratively going to the children of the cell if the number of results exceed 60.

#res6 ~3700m
#Get the highest resolution cells
max_khi_hex_df = khi_hex_df[khi_hex_df.hex_resolution==khi_hex_df.hex_resolution.min()]

# We created a seperate folder to store all output, in case the code is interrupted we can use
# this list to make sure we're not repeating any cells.

searched_cells = os.listdir(f'../data/h3cell_output_cafe/')
searched_cells = set([x.replace('.csv','') for x in searched_cells])

# Setting POI type for cafe.
type = 'cafe'

for ind,row in tqdm.tqdm(max_khi_hex_df.iterrows(),
total=max_khi_hex_df.shape[0]):

# Parsing the highest resolution cells and saving outputs in a csv file
h3_cell = row.hex_id
if(h3_cell not in searched_cells):
lat,lng = row.centroid[0],row.centroid[1]
resolution = row.hex_radius_places_api
# Call the Places using the the function we wrote earlier
df = func_get_places_poi(lat=lat,lng=lng,resolution=resolution,type=type,api_key=API_KEY)
df['hex_id'] = h3_cell
cell_save_path = f'../data/h3cell_output_{type}/{h3_cell}.csv'
df.to_csv(cell_save_path,index=False)

# Key part of the algorithm, if the API returns equal or more than 60 results
if(len(df)>=60):
cell_children = h3.h3_to_children(df['hex_id'][0])
child_df = khi_hex_df[khi_hex_df.hex_id.isin(cell_children)]

#res 7 ~1400m
for ind,row in child_df.iterrows():
h3_cell = row.hex_id
lat,lng = row.centroid[0],row.centroid[1]
resolution = row.hex_radius_places_api
df = func_get_places_poi(lat=lat,lng=lng,resolution=resolution,type=type,api_key=API_KEY)
df['hex_id'] = h3_cell
cell_save_path = f'../data/h3cell_output_{type}/{h3_cell}.csv'
df.to_csv(cell_save_path,index=False)

# Repeat same logic as above on all other resolutions
if(len(df)>=60):
cell_children = h3.h3_to_children(df['hex_id'][0])
child_df = khi_hex_df[khi_hex_df.hex_id.isin(cell_children)]

# res 8 ~500m
for ind,row in child_df.iterrows():
h3_cell = row.hex_id
lat,lng = row.centroid[0],row.centroid[1]
resolution = row.hex_radius_places_api
df = func_get_places_poi(lat=lat,lng=lng,resolution=resolution,type=type,api_key=API_KEY)
df['hex_id'] = h3_cell
cell_save_path = f'../data/h3cell_output_{type}/{h3_cell}.csv'
df.to_csv(cell_save_path,index=False)

# You can go as deep as you want, down to res 9 and res 10 similary
# Yes, this should be a recursive funciton but I'll leave that to you? :)

This should take some time to execute, as the Places API is called and processed at different locations and resolutions. The progress bar from tqdm tracks progress on the highest resolution polygon and should be a good enough estimate of how much work is left.

After the code above finishes execution, we compile all our data and figure out whether any of this works or not.

# We can see that the algorithm did search 34 lower resolution hexagons,
# where we managed to saturate the API at a higher reesolution.
paths = os.listdir('../dat a/h3cell_output_cafe')
paths = ['../data/h3cell_output_cafe/'+path for path in paths]
n_df = pd.concat([pd.read_csv(path) for path in paths])
khi_hex_df[khi_hex_df\
.hex_id\
.isin(n_df.hex_id.unique())]\
.hex_resolution.value_counts()

You will see that there are 56 cells at resolution 6 containing data whereas there are 34 cells at resolution 7 with data, meaning their parent cell was parsed and 60 results were returned within that area, resulting in the algorithm searching in the subsequent child cells / hexagons.

All that looks fine in theory, but how does that look like on a map? Let’s add a new layer to the original folium map we created to visualize the hexgrid at resolution 6.

# Important we first de-duplicate our data by using the place_id,
# which is unique for each POI on Google Maps.
n_df = n_df.drop_duplicates(subset='place_id')
n_df = n_df.reset_index(drop=True)
n_df['location'] = n_df.geometry.apply(lambda x: ast.literal_eval(x)['location'])
n_df['lat'] = n_df['location'].apply(lambda x: x['lat'])
n_df['lng'] = n_df['location'].apply(lambda x: x['lng'])
for ind,row in n_df.iterrows():
lat = row.lat
lng = row.lng
folium.CircleMarker(location=(lat,lng),
radius=5,
color='blue').add_to(m)
m

This is what the resulting map looks like, zoom adjusted.

POI data layered over H3 cells at resolution 6

Each circle marker you see above is one cafe, which is pretty good visual representation of our code working as intended. You may even notice more central locations in Karachi, where common sense dictates there would be more cafes, easily have more than 60 markers!

Does this help you with whatever project, hustle, or paper you are working? Do you think there’s a better way to pull POI data? I am eager to here your thoughts in the comments section.

All code and data are available in this GitHub repository:

--

--

Hisham Sajid

Data science, public policy, and everything geospatial. Can be reached out to on twitter.com/hishamsajid