Handling Geospatial Data and Mapping in Python

Alaa Khamis
AI4SM
Published in
12 min readAug 31, 2023
Photo by NASA on Unsplash

Geography is the study of places and the relationships between people and their environments [National Geographic]. It’s the science of space and place [American Association of Geographers (AAG)]. Geographers ask:

  • where things are located on the surface of the earth,
  • why they are located where they are,
  • how places are different from one another, and
  • how people interact with the environment.

Common tools used by Geographers include the following:

  • Remote Sensing: Remote sensing is the measurement of an object’s characteristics from a distance using reflected or emitted electromagnetic energy.
  • Cartography: Cartography is the science and art of making maps. A map is a means of communicating geographic information.
  • Geographic Information Systems (GIS): GIS is a computerized framework for capturing, storing, retrieving, manipulating, analyzing, and displaying spatial and geographic data and information. GIS mainly deals with two categories of data: (i.e. absolute and relative location of features) and what (i.e. the properties and attributes of those features). However, the use of GIS can be extended to include other questions such as why (i.e., why features are located where they are) and how (i.e, how these features are different from one another, and how they interact with each other).

Geospatial data

Geospatial data, is any data related to or containing information about a specific location on the Earth’s surface. Examples include, but are not limited to:

  • Locations of people, businesses, assets, natural resources, new developments, services and other built infrastructure.
  • Spatially distributed variables such as traffic, health statistics, demographics and weather.
  • Data related to environmental change –ecology, sea level rise, pollution, temperature, etc.
  • Data related to coordination of responses to emergencies, natural and man-made disasters — floods, epidemics, terrorism.

Spatiotemporal Analytics

Spatiotemporal analysis facilitates converting geospatial data into valuable information along the 4 lines of sight shown in the following figure. As explained in my previous article, descriptive analytics, diagnostic analytics, predictive analytics, and prescriptive analytics are different forms of analytics. Descriptive data analytics provides insight into the past and the present while diagnostic analytics provides root-cause analysis. Predictive analytics forecasts the future and prescriptive analytics advises on possible outcomes and their anticipated impacts.

The 4 lines of sight [Image by the author]

Consider the specifications, locations, and CO2 emissions of Internal Combustion Engine (ICE) vehicles. While this data is straightforward and manageable, it doesn’t offer significant value or competitive advantage by itself. However, when we employ descriptive analytics, we can derive deeper insights, such as identifying CO2 emission hotspots, determining emissions on a per-passenger basis, and assessing the daily commute’s carbon footprint. Diagnostic analytics can further elucidate the underlying causes of these emission hotspots, such as traffic jams involving ICE vehicles, deadheading, or proximity to industrial zones. Even more, predictive analytics equips us with the capability to forecast city-wide and global CO2 emission trends, understanding their relationship with climate change and public health over the long run. Ultimately, prescriptive analytics aids decision-makers by offering actionable recommendations, such as phasing out ICE vehicles or introducing incentives for electric vehicles, all aiming towards carbon neutrality or even a carbon-negative future.

Getting geospatial data from open sources

Different open source repos and tools enable querying geospatial data. There are several Python libraries to handle geospatial data. The following colab provides examples of how to handle geospatial data.

Getting data using Overpass API

Querying points of interest (POIs) using Overpass API. Overpass API is OSM’s querying API. It is incredibly powerful in that it can very quickly return queried features, and allows for selection of location, tags, proximity, and more. Let’s query for restaurants near the University of Toronto.

# !pip install overpass
import overpass

api = overpass.API()

# We're looking for restaurants within 1000m of a given point
overpass_query = """
(node["amenity"="restaurant"](around:1000,43.66, -79.39);
way["amenity"="restaurant"](around:1000,43.66, -79.39);
rel["amenity"="restaurant"](around:1000,43.66, -79.39);
);
out center;
"""

restaurants = api.get(overpass_query)

The example above uses the overpass package, which by default returns results in geojson format. See the overpass documentation for more information.

Next, let’s extract some data about each restaurant, and then plot all of them on a map. This time, we’ll use a plotly ScatterMapBox, which uses tiles from MapBox. You can refer to plotly’s documentation here. Each POI on that map has a tooltip that shows the restaurant’s name when hovered.

import plotly.graph_objects as obj

# Extract the lon, lat and name of each restaurant:
coords = []
text = []
for elem in restaurants['features']:
latlon = elem['geometry']['coordinates']
if latlon == []: continue
coords.append(latlon)
if 'name' not in elem['properties']:
text.append('NONAME')
else:
text.append(elem['properties']['name'])

# Convert into a dictionary for plotly
restaurant_dict = dict(type='scattermapbox',
lat=[x[1] for x in coords],
lon=[x[0] for x in coords],
mode='markers',
text=text,
marker=dict(size=8, color='blue'),
hoverinfo='text',
showlegend=False)


# plotting restaurants' locations around University of Toronto
center=(43.662643, -79.395689) # UofT main building

fig = obj.Figure(obj.Scattermapbox(restaurant_dict))

# defining plot layout
fig.update_layout(mapbox_style="stamen-terrain")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0}, mapbox = {'center': {'lat': center[0], 'lon': center[1]}, 'zoom': 13})
fig.show()
Restaurants’ locations around University of Toronto
Restaurants’ locations around University of Toronto [Image by the author]

You can compile your own dataset using the Overpass QL language that runs on Overpass turbo. You can use this query language to mine OpenStreetMaps data, filter it, and get it ready to be used by osmnx or any library that parses .osm files. Below is a quick review about using Overpass API, which is the official API for reading data from OpenStreetMap servers. Most of the online routing services and software use it. Additionally, we will usually use Nominatim to do geocoding/geo-decoding; translating addresses to/from (latitude-longitude).

Getting data using Overpass turbo’s Wizard

Overpass turbo’s Wizard provides an easy way to auto-generate Overpass QL queries. Wizard syntax is similar to that of a search engine. An example of Wizard syntax is amenity=hospital that generates an Overpass QL query to find all the hospitals in a certain region of interest. Hospital locations will be visualized on the map and can be downloaded/copied using the “Export” button. The data can be exported as GeoJSON, GPX, KML, raw OSM data, or raw data directly from Overpass API. You can then use osmnx to read .osm files with osmnx.graph_from_xml.

Fetching OpenStreetMap data using osmnx

osmnx is a Python library developed to ease the process of retrieving and manipulating the data from OpenStreetMap (OSM). It offers the ability to download the data (filtered) from OSM and returns the network as networkx graph data structure. The following code snippet shows how to fetch OSM data using osmnx.

import osmnx as ox
import matplotlib.pyplot as plt

place_name = "Old Toronto, Toronto"

# fetch OSM street network (drive, walk, private, etc.) from the location
graph = ox.graph_from_address(place_name, network_type='walk')
fig, ax = ox.plot_graph(graph)
Old Toronto, Toronto, Ontario [Image by the author]

You can extract the nodes and edges of the graph as separate structures.

nodes, edges = ox.graph_to_gdfs(graph)

Street types can also be retrieved for the graph as following:

print(edges['highway'].value_counts())

The following line shows the network statistics

ox.basic_stats(graph)

You can visualize points of interest (POIs) in the city using folium. For example, let’s assume that we’d like to visualize the Financial District, Yonge-Dundas Square and St. George campus of University of Toronto as POIs.

import folium

# Financial District, Toronto, ON
source_point = (43.6479, -79.3818)

# Yonge-Dundas Square
center=(43.6561, -79.3802)

# University of Toronto
destination_point = (43.6634, -79.3960)

m = folium.Map(location=center, zoom_start=15, scrollWheelZoom=False, dragging=False)
folium.Marker(location=source_point,icon=folium.Icon(color='red',icon='camera', prefix='fa')).add_to(m)
folium.Marker(location=center,icon=folium.Icon(color='blue',icon='fa-shopping-cart', prefix='fa')).add_to(m)
folium.Marker(location=destination_point,icon=folium.Icon(color='green',icon='graduation-cap', prefix='fa')).add_to(m)
Financial District, Yonge-Dundas Square and St. George campus of University of Toronto visualized as POIs [Image by the author]

Now, let’s find the shortest path between the Financial District to St. George Campus of University of Toronto. To calculate the shortest path, we first need to find the closest nodes on the network to our starting and ending locations.

import geopandas
import networkx

X = [source_point[1], destination_point[1]]
Y = [source_point[0], destination_point[0]]
closest_nodes = ox.distance.nearest_nodes(graph,X,Y)

# Get the rows from the Node GeoDataFrame
closest_rows = nodes.loc[closest_nodes]

# Put the two nodes into a GeoDataFrame
od_nodes = geopandas.GeoDataFrame(closest_rows, geometry='geometry', crs=nodes.crs)

shortest_route = networkx.shortest_path(G=graph,source=closest_nodes[0],target=closest_nodes[1], weight='length')
print(shortest_route)
Shortest path between the Financial District to St. George Campus of University of Toronto [Image by the author]

Let’s make a map that shows the above route, with both starting and ending nodes shown as markers using draw_route implemented as part of our Python package optalgotools.

!pip install optalgotools
from optalgotools import routing

routing.draw_route(graph, shortest_route)
Shortest path between the Financial District to St. George Campus visualized on a map [Image by the author]

Just like our graph above, we can also retrieve all the building footprints of a named place.

# Retrieve the building footprint, project it to match our previous graph, and plot it.
buildings = ox.geometries.geometries_from_address(place_name, tags={'building':True}, dist=300)
buildings = buildings.to_crs(edges.crs)
ox.plot_footprints(buildings)

Now that we have the building footprints of old Toronto, let’s plot that shortest route again. First, get the nodes from the shortest route, create a geometry from it, and then visualize building footprints, street network, and shortest route all on one plot.

from shapely.geometry import LineString

# Nodes of our shortest path
route_nodes = nodes.loc[shortest_route]

# Convert the nodes into a line geometry
route_line = LineString(list(route_nodes.geometry.values))

# Create a GeoDataFrame from the line
route_geom = geopandas.GeoDataFrame([[route_line]], geometry='geometry', crs=edges.crs, columns=['geometry'])

# Plot edges and nodes
ax = edges.plot(linewidth=0.75, color='gray', figsize=(15,15))
ax = nodes.plot(ax=ax, markersize=2, color='gray')

# Add building footprints
ax = buildings.plot(ax=ax, facecolor='khaki', alpha=0.7)

# Add the shortest route
ax = route_geom.plot(ax=ax, linewidth=2, linestyle='--', color='red')

# Highlight the starting and ending nodes
ax = od_nodes.plot(ax=ax, markersize=100, color='green')
Shortest path between the Financial District to St. George Campus with building footprint [Image by the author]

Getting data from open data repositories

This is an example of how to read data from URL. The Bike Share Toronto Ridership data contains anonymized trip data, including: Trip start day and time, Trip end day and time, Trip duration, Trip start station, Trip end station, User type. This dataset is from Toronto Parking Authority, published on https://open.toronto.ca/dataset/bike-share-toronto-ridership-data/.

import requests
import json
import matplotlib.pyplot as plt
import folium

# URL of the Bike Share dataset file
url = 'https://tor.publicbikesystem.net/ube/gbfs/v1/en/station_information'

# Fetch the contents of the URL
response = requests.get(url)

# Extract the JSON content from the response
json_data = response.json()

# Extract the station informations
stations = json_data.get("data", {}).get("stations", [])
extracted_data = []
for station in stations:
extracted_data.append({"station_id": station.get("station_id"), "name": station.get("name"), "lat": station.get("lat"), "lon": station.get("lon"), "address": station.get("address"), "capacity": station.get("capacity")})

# Create a folium map centered around Toronto
toronto_map = folium.Map(location=[43.651070, -79.347015], zoom_start=12)

# Add markers for each bike share station
for station in extracted_data:
lat = station["lat"]
lon = station["lon"]
id = station["station_id"]
name = station["name"]
capacity = station["capacity"]

# Create a marker and add it to the map
folium.Marker(
location=[lat, lon],
popup=f"Id: {id}<br>Name: {name}<br>Capacity: {capacity}",
icon=folium.Icon(color="blue", icon="info-sign"),
).add_to(toronto_map)

# Display the map
toronto_map
Bike share stations in the City of Toronto [Image by the author]

Let’s visualize the traffic statistics provided by Canada Stat. We start by loading the data from Optimization Algorithm GithHub repo as following:

import geopandas as gpd
import pandas

# Import polygons for Canadian provinces and concatenate to traffic statistics
# Define the raw URL
url_1 = "https://raw.githubusercontent.com/Optimization-Algorithms-Book/Code-Listings/05766c64c5e83dcd6788cc4415b462e2f82e0ccf/Appendix%20B/data/CanadaTraffic/canada.geojson"
url_2 = "https://raw.githubusercontent.com/Optimization-Algorithms-Book/Code-Listings/05766c64c5e83dcd6788cc4415b462e2f82e0ccf/Appendix%20B/data/CanadaTraffic/Canada_traffic.csv"

# Read the GeoJSON into a GeoDataFrame
canada = gpd.read_file(url_1)
traffic_accidents = pandas.read_csv(url_2)

dataset = pandas.concat([canada,traffic_accidents],axis=1).reindex(canada.index)
dataset =dataset.drop(columns=['Province','created_at','updated_at'])

There are different ways to visualize this data.

Chloropleth Map: There are various ways to visualize geospatial data on a map. Below are a few examples of some common visualization types.

m = folium.Map(location=[58.4052172,-109.6062729],zoom_start=3, scrollWheelZoom=False, dragging=True)

# Setup binning for legend
bins = list(dataset['Injuries per 100,000 population'].quantile([0,0.25,0.5,0.75,1]))

# Main map init
folium.Choropleth(
geo_data=dataset,
name='chloropleth',
data=dataset,
columns=['name','Injuries per 100,000 population'],
key_on='feature.properties.name',
fill_color='BuPu',
fill_opacity=0.7,
line_opacity=0.5,
legend_name='Injuries per 100,000 population',
bins=bins,
reset=True
).add_to(m)
folium.LayerControl().add_to(m)

# Tooltip styling
style_function = lambda x: {'fillColor': '#ffffff',
'color':'#000000',
'fillOpacity': 0.1,
'weight': 0.1}
highlight_function = lambda x: {'fillColor': '#000000',
'color':'#000000',
'fillOpacity': 0.50,
'weight': 0.1}

# Tooltip
NIL = folium.features.GeoJson(
dataset,
style_function=style_function,
control=False,
highlight_function=highlight_function,
tooltip=folium.features.GeoJsonTooltip(
fields=['name','Fatalities per 100,000 population','Injuries per 100,000 population','Fatalities per Billion vehicles-kilometres','Injuries per Billion vehicles-kilometres','Fatalities per 100,000 licensed drivers','Injuries per 100,000 licensed drivers'],
aliases=['Province','Fatalities per 100,000 population','Injuries per 100,000 population','Fatalities per Billion vehicles-kilometres','Injuries per Billion vehicles-kilometres','Fatalities per 100,000 licensed drivers','Injuries per 100,000 licensed drivers'],
style=("background-color: white; color: #333333; font-family: arial; font-size: 12px; padding: 10px;")
)
)
m.add_child(NIL)
m.keep_in_front(NIL)
Statistics Canada’s dataset for Motor Vehicle Collisions in 2018 [Image by the author]

Cartogram Map: Cartogram maps distort the area or distance scale of a map to reflect a chosen variable. In this case, we can distort the area of each individual province to scale to its share of motor vehicle injuries in 2018. In this cartogram map, we can see that Manitoba has a disproportionate share of the motor vehicle injuries.

!pip install geoplot
import geoplot

ax = geoplot.cartogram(
dataset,
scale='Injuries per 100,000 population',
legend=True,
hue='Injuries per 100,000 population',
legend_var='hue',
cmap='BuPu',
limits=(0.15,2),
projection=geoplot.crs.WebMercator(),
figsize=(15,15))
geoplot.polyplot(dataset, facecolor='lightgray', edgecolor='white', ax=ax)
Statistics Canada’s dataset for Motor Vehicle Collisions in 2018 [Image by the author]

Bubble Map: Bubble maps show properties of the points being visualized through colour and size variations. In the example below, we vary Marker size by the Injuries per 100,000 population, while we colour provinces in blue and territories in red.

territories = ['Yukon Territory','Nunavut','Northwest Territories']

m = folium.Map(location=[58.4052172,-109.6062729],zoom_start=3, scrollWheelZoom=False, dragging=True)

for i,r in dataset.iterrows():
folium.Circle(
location=[r['geometry'].centroid.y,r['geometry'].centroid.x],
radius = float(r['Injuries per 100,000 population'])*500,
fill=True,
popup=f"{r['name']}: {r['Injuries per 100,000 population']}",
color = 'blue' if r['name'] not in territories else 'crimson'
).add_to(m)
m
Statistics Canada’s dataset for Motor Vehicle Collisions in 2018 [Image by the author]

Hexagonal Binning: Hexagonal binning is a kind of aggregate visualization method where hexagons are drawn over the map, with each hexagon summarizing/abstracting the data that it contains. The example below uses traffic accident data for Toronto, retrieved from the Toronto Police Service’s Public Safety Data Portal.

import plotly.figure_factory as ff

# Import polygons for Canadian provinces and concatenate to traffic statistics
# Define the raw URL
url = "https://raw.githubusercontent.com/Optimization-Algorithms-Book/Code-Listings/05766c64c5e83dcd6788cc4415b462e2f82e0ccf/Appendix%20B/data/Police/KSI.geojson"

# Read the GeoJSON into a GeoDataFrame
ksi = gpd.read_file(url)

fig = ff.create_hexbin_mapbox(
data_frame=ksi,
lat="LATITUDE",
lon="LONGITUDE",
nx_hexagon=15,
opacity=0.5,
min_count=1,
labels={'color':'Reported Accidents'},
color_continuous_scale='turbo'
)

fig.update_layo
Traffic accident data for Toronto [Image by the author]

Heat Map: Heat maps use a colour gradient to represent increasing density in data. As the underlying data points increases in density, so does the intensity (or shade) of the heatmap colour. Below is an example using the same Toronto traffic data from before.

import plotly.express as px

fig = px.density_mapbox(
data_frame=ksi,
lat="LATITUDE",
lon="LONGITUDE",
mapbox_style='open-street-map',
radius=1,
zoom=9,
)
fig.update_layout(coloraxis_showscale=False, hovermode=False)
Traffic accident data for Toronto [Image by the author]

Cluster Map: Cluster maps aggregate data by merging neighouring points into one larger “cluster”. This generally improves performance when rendering larger datasets, and provides an easy-to-access indicator of density.

from folium.plugins import FastMarkerCluster
m = folium.Map(location=[43.6532,-79.3832], zoom_start=10, scrollWheelZoom=False, dragging=True)
FastMarkerCluster([[r.LATITUDE,r.LONGITUDE] for r in ksi.itertuples()]).add_to(m)

m
Traffic accident data for Toronto [Image by the author]

Geospatial Datasets

This section includes a non-exhaustive list of datasets that may be of interest to you. Some sources are open source, while others offer “freemium” or paid options.

1. Available open datasets

2. Commercially available datasets

3. Elevation Data

4. Traffic Datasets

5. Parking Tickets Datasets

6. Public Transport Networks Datasets

7. Planned events, road work, and other temporary changes to the road network and bridge, tunnel and ferry events

8. Traffic Crashes Datasets

9. Emission Datasets

10. Environmental Datasets

11. Mobility-aware urban design, active transportation modeling and access analysis for amenities and public transport

12. Accessibility

13. Crime map

14. Open Source Projects

Citation: Alaa Khamis. Optimization Algorithms:Optimization Algorithms: AI techniques for design, planning, and control problems. Manning Publications, ISBN 9781633438835, 2023.

--

--

Alaa Khamis
AI4SM
Editor for

AI and Smart Mobility Technical Leader at General Motors Canada | Former Professor of AI and Robotics