OSM Overpass API: SQL for Geography

Sarnath Kannan
Engineered @ Publicis Sapient
17 min readNov 16, 2020
Photo by Greg Rosenke on Unsplash

Welcome to this very detailed article on Openstreetmap (OSM). I will show you each and every step of how to use OSM API to discover geographic features using the Overpass query language and how to integrate that with Python.

I had spent a lot of time trying to figure out how this works. Available documentation on this topic is sparse, and so I thought I would write one myself so that you can spend your time more effectively!

Time and again, modelers face the task of improving model accuracy and they invariably look at improving the richness of the features as the last resort. One such important rich feature is competition density for a business.

The OSM API can be used to obtain this information. However, the limitation of OSM API is that it cannot give you information on the sales happening in those competitor locations.

If you want to look for competition sales, you can try data syndicators, curators, card-payment gateways and other data banks that sell such curated data to businesses. A well-known example of this is the IRI, Nielsen data that provides retail sales stats for a business and includes competitor stats as well (or Rest of market ROM).

Alternatively, competition location information can also possibly be obtained through Map APIs from Google, Microsoft and others. The paid versions of these APIs possibly have more accurate geo information and you can always choose which way you want to go and each one has its strengths.

However, the advantage that OSM has over these APIs is that OSM exposes a query language (something like an SQL for Geography data). So once you learn this language, you can extract any information you need from the geography. And this can be very powerful.

In this article, I will show you how you can reap such raw location information from OSM (Open Street Map), how to pre-process it using ‘geopandas’ package and display a Choropleth geography map using the ‘folium’ package. OSM API is free and you can quickly build a DB for your business without any cost.

The use case

The use case I will use is to locate airports within Indian states and the hotels within 5 kms vicinity of those airports. In this journey, I will expose how the Overpass API works and what are the potential pitfalls, so that you can be deft while handling your queries. Once we extract data using the API, I will show how you can process that data in Python.

Part 1: Open Street Map Introduction

Indian states

Let us first see how we can build a state-level political map of India. To do that, please fire up this URL (https://overpass-turbo.eu/) in your browser. This website allows you to query geography and see results for yourself. The queries are on the left side and the geographical rendering of results and the raw data appear on the right-hand side as separate tabs. Think of this as a sandbox where you can refine your queries before you decide to script your query and download the geography.

The API has rolling rate-limits and is throttled! So watch your usage!

Enter the below query and then click on the “Run” button

[out:json];

area[“name:en”=”India”][“admin_level”=”2"]->.india;

rel(area.india)[‘admin_level’=’4'];

out geom;

NOTE: I have edited the image for political correctness for Indian audience.

To see raw JSON data, hit the “Data” tab on your right and go through the JSON. You will see relations, ways and nodes with latitude and longitude information.

NWR in OSM

Now is a good time to introduce the NWR concept in Open Street Map. NWR stands for Nodes, Ways & Relations.

Node: A node is a single geo-point (geo pin) that is described primarily by latitude and longitude. A node can have additional attributes as needed. For e.g., “name:en” is an attribute (OSM Tag) that holds English name for a location.

  • Click here to examine an OSM node
  • Take a look at the location and Tags information that describe that node

Ways: A way is a group of nodes that usually forms a closed boundary. When they form closed boundaries, they are called “closed ways”. When they do not form closed boundaries, they are called “open ways”.

  • Click this way to know more about what it is
  • Take a look at its Tags and also the constituent nodes
  • Is this a closed way or an open way?
  • Click on the constituent nodes of this way and understand its geographical relation to the parent way

Relations: A relation is a logical construct introduced in OSM that acts as a mapping between common-worldly-knowledge to OSM entities.

  • For e.g., District is an administrative unit used and is common-knowledge. To represent that, OSM uses a Relation with a tag that conveys the fact that it stands for a district. Such a Relation also contains the nodes that form the boundary of that district.
  • A Relation is not restricted to administrative units. It can represent arbitrary relationships, and so the data conveyed in a Relation depends on the type of relationship.

Areas: Area is again a logical construct. It always represents a closed geographical area. The big advantage is that Areas can be queried for geographical entities that are covered inside the geographical area.

  • Ways and Relations can be converted back to Area and vice-versa using map_to_area and pivot operations respectively
  • This allows searching geographical areas for specific entities of interest

Let us now dissect the query to understand what is going on. This query is an Overpass query. The overpass query language is the language used to query OSM database. The reference guide for the Overpass QL can be found here.

[out:json];

Output of this query will be in JSON format

area[“name:en”=”India”][“admin_level”=”2"]->.india;

Filter areas with name:en (English name tag) set to India

Filter again for areas with “admin_level” tag set to 2

Save the output to a named set called “india”

NOTE: This set “India” will have elements of type ‘area’ only since we filtered on areas.

NOTE: Admin level of 1 stands for trans-country boundaries (like European Union) and Admin level of 2 stands for national boundaries.

rel(area.india)[‘admin_level’=’4'];

Filter relations present inside the areas present in the set ‘India’ with admin_level of ‘4’

Since no named output set is specified, the output is written to default set.

NOTE: The rel() is an area filter which retrieves all types of relations inside the geographical area.

NOTE: Search the Overpass QL guide for “area filter”.

NOTE: Admin level ‘4’ corresponds to ‘states’ for India. This can change from country to country.

NOTE: Admin level semantics of various countries can be found in this wiki here.

out geom;

Output the default set along with geometry information.

NOTE: ‘geom’ returns the latitude and longitude information.

NOTE: When run without ‘geom’ argument, you will still get Data but nothing will be rendered in the Geography map.

Take-aways

  1. Output of OSM query statements are all “sets”
  2. Area filter operations enable us to query inside geographical areas
  3. Tag filters allow us to narrow down and search what is our interest

Part 2: Finding Airports in India

I will take you through several iterations of this query. In this process, you will understand what are the common pitfalls and how to avoid them.

Iteration-1

[out:json];

area[“name:en”=”India”][“admin_level”=”2"]->.india;

node(area.india)[‘name:en’~’airport’,i];

out;

Take-aways

  • We are using an area filter again. This time, we are looking for “nodes” inside the given area (India)
  • See how ~ is used to search for regular expression
  • See how the regular expression search is made “case-insensitive” by the use of “i” option

Assumptions

  • We are looking for airports which are available as “nodes” in OSM
  • We are looking for nodes that have “airport” in their names

Result

Here are our results with 23 nodes with airport in their names in India!

Iteration-1 : Airports in India

Verification

Google for ‘number of airports in India’. The number of nodes we get should be close to this number. But we are not anywhere close. Google reports 486 airports including civilian, military and other flying fields. 23 is too less. What are we missing?

Debugging

We all know Indira Gandhi airport in New Delhi is a very famous airport. Can we locate that in the result? Click on the yellow circle near New Delhi and the name says “IGI Airport Terminal 1 Metro Mag Line”. That really looks like some train terminal inside the Airport and possibly not a representative of the main airport.

Let us now look up Indira Gandhi airport in www.openstreetmap.org and see how it looks like and use that to refine our queries.

Here is the result:

Search for IGI in Openstreetmap.org

Let us click on the “Aerodrome” link and go further.

Aerodrome entry from the previous search result

It stands for a “Way” and not really a “node”. This is the reason why our query is not picking it up.

Spend some time to look at other attributes as well.

Iteration-2

[out:json];

area[“name:en”=”India”][“admin_level”=”2"]->.india;

way(area.india)[‘name:en’~’airport’,i];

out;

Result

We got 28 ways now. Still way less than 486.

But these are NOT displaying on the graph.

The geography displays only when “nodes” are present.

Click on the “Data” tab on the right (next to Map tab) to examine the data manually.

If you would like to display the “ways” on Geography, try this query:

[out:json];

area[“name:en”=”India”][“admin_level”=“2”]->.india;

way(area.india)[‘name:en’~’airport’,i];

(._;._>;);

out;

The bold highlighted ones are those changes.

No, it is not Arabic or Greek.

Let me explain.

What is the meaning of ( ) bracket?

In Overpass query language, the output of every statement is a “Set”. So that means we can apply set operators like Union, Intersection etc. on multiple statements. Here ‘(‘ refers to the UNION operator. It has 2 statements inside it. They are “._;” and “._>;”

What is “._;”

In Overpass query language, “_” represents the default set. The output of every statement is a “Set”. If an overpass statement does not contain an output set name (like -> .india), then the output of that statement goes to the default set and it overwrites whatever was there in the default set earlier. Since the name of the default set is “_”, the way to refer that in a statement is “._”

So this statement, ‘._;’ simply includes all the members of the default set as part of the enclosing UNION operator. In this context, the default set contains the output of the immediately preceding statement (i.e. the ‘way’ statement).

What is “._>;”

The “>” symbol is a “recurse down” operator. Refer the overpass guide for full details. Shortly, it expands the contents of the set by including child-level details of “ways” in the set, membership details of “relationships” in the set etc. “Child” represents next level details and what that means totally depends on whether the element is a way/relationship, etc. For a “way”, this includes the node level details of a way as a part of the output set. In this context, we are trying to get some display of the data on the output screen. And this statement helps us do that.

Result

We get 1903 nodes and 28 ways now. Now, the geography map displays the details. Click on the circles to see what they are. Zoom inside a way to see what it looks like. Examine the raw data in the “Data” tab. Compare it to what you saw earlier without the Recurse down operator. This will help you understand what the ‘Recurse down’ operator is doing.

Note: 28 is still our primary output. The 1903 nodes represent the points that belong to these 28 ways and so these nodes are not really airports.

Take-away

  • We now know how to use the “way” area filter to look into all the “way” inside the “area” (India)
  • We learnt that the geography map does not display ‘ways’. It only displays “nodes” and hence the “ways” are not displayed. We also found a way how to make it display using the ‘recurse down’ operator
  • We now know how to look in the “Data” tab to get to know the raw data in full detail if needed
  • We now know how to use the “recurse down” operator and its impact on the result
  • We now know how to use the “UNION” operator in an Overpass query

Deliberations

We still are not close to the count reported by Google. We are still missing airports. Why?

Are we missing out airports that were “node” in OSM and not “way”. Did we miss them? There could have been genuine airports too that were just nodes!

Iteration-3

Let us combine both ways and nodes. We know how to do that. UNION operator! We do not recurse down here so that we can get the correct count of Airports.

[out:json];

area[“name:en”=”India”][“admin_level”=”2"]->.india;

(

way(area.india)[‘name:en’~’airport’,i];

node(area.india)[‘name:en’~’airport’,i];

)

out;

Result

We still are short of the 486 figure — expectedly.

Deliberations

Look up some results and we see a property called Aeroway. Each one of these tags are documented. We had ignored this before. Now, let us take a look. https://wiki.openstreetmap.org/wiki/Aeroways

Example: Warangal airport does not have an airport in name, but has property of Aerodrome.

Take-away

  • Tags are important identifiers. Put them to good use. If you find an OSM tag that is related to what you are looking for, go for it!
  • Life lesson: What you ignore, will come back to bite you

Iteration-4

[out:json];

area[“name:en”=”India”][“admin_level”=”2"]->.india;

(

way(area.india)[‘name:en’~’airport’,i];

node(area.india)[‘name:en’~’airport’,i];

way(area.india)[‘aeroway’=’aerodrome’];

node(area.india)[‘aeroway’=’aerodrome’];

);

out;

Result

We have 67 nodes and 312 ways. Looks like, we are close now. But how accurate and noise-free is this result?

Deliberations

Look up ‘Puducherry airport’ in OSM.

  • “Puducherry airport” does not have “name:en” property. It has only “name” property
  • Is our reliance on “name:en” property misplaced?
  • There is considerable noise in this result. We see a lot of airport-related stuff like “Airport bus stop”, “Airport Metro stations”, “Offices of airport authority of India” etc. that have “airport” in their names. So we need refinement!

Take-away

· Name-based searches can lead to a lot of noise in the Result set

· Informative tags are the best leads. But if you can’t get it, you can go for Name-based search and then weed out using a post-processing step (custom code written in Python + manual verification)

Iteration-5

[out:json];

area[“name:en”=”India”][“admin_level”=”2"]->.india;

(

way(area.india)[‘aeroway’=’aerodrome’];

node(area.india)[‘aeroway’=’aerodrome’];

)->.indianairports;

(.indianairports; .indianairports>;);

out;

Result

We get 310 ways and 13663 nodes on it. 310 is our main result. 13663 contains node points inside these 310 ways (due to recurse down operator) and possibly also stand-alone airport nodes.

Deliberations

We also see nodes in the result that have no names whatsoever (by examining the raw data in the Data tab). It is useless to have these points in the result.

We also see “recurse-down” operator is adding a lot of nodes which are part of “ways” and these can be quite noisy too. We may not need those.

Iteration-6

area[“name:en”=”India”][“admin_level”=”2"]->.india;

(

way(area.india)[‘aeroway’=’aerodrome’][‘name’];

way(area.india)[‘aeroway’=’aerodrome’][‘name:en’];

node(area.india)[‘aeroway’=’aerodrome’][‘name’];

node(area.india)[‘aeroway’=’aerodrome’][‘name:en’];

)->.indianairports;

.indianairports out geom;

Result

We now get 294 ways and 43 nodes that are of type “aerodrome” and have a name associated with them.

Take-away

  • We can filter on merely presence of “tags” as well by just using the tag name and not specifying any filter on its value. This is how we filter aerodromes that at least have “name” (or) “name:en” tag associated with them.

Finding airports and nearby hotels

[out:json];

area[“name:en”=”India”][“admin_level”=”2"]->.india;

(

way(area.india)[‘aeroway’=’aerodrome’][‘name’];

way(area.india)[‘aeroway’=’aerodrome’][‘name:en’];

node(area.india)[‘aeroway’=’aerodrome’][‘name’];

node(area.india)[‘aeroway’=’aerodrome’][‘name:en’];

)->.indianairports;

(

node(around.indianairports:5000)[‘tourism’=’hotel’];

way(around.indianairports:5000)[‘tourism’=’hotel’];

) ->.nearbyhotels;

(.indianairports; .nearbyhotels;);

out geom;

TAKE-AWAY

  • Using ‘around’ filter to locate geographic objects based on Proximity

Part 3: OSM query output to Geojson format

Most Python packages require Geojson format for geographic displays. The OSM query output is in JSON but not really Geojson format. It is OSM’s own JSON format. So we need to convert OSM output to Geojson before we can integrate that with Python.

Step 1: Prepare query content for POST

Till now, we were using the overpass turbo website to refine our query. For practical purposes, we need to programmatically query and download the results. Now that we have our query ready, we can issue a “POST” command to the Overpass interpreter service and then download the results. To do this, we first need to prepare content for the POST query.

So, create a file called “indianairportsrestaurants.opass” with the following content that will be used as POST content.

data=area[“name:en”=”India”][“admin_level”=”2"]->.india;

(

way(area.india)[‘aeroway’=’aerodrome’][‘name’];

way(area.india)[‘aeroway’=’aerodrome’][‘name:en’];

node(area.india)[‘aeroway’=’aerodrome’][‘name’];

node(area.india)[‘aeroway’=’aerodrome’][‘name:en’];

)->.indianairports;

(

node(around.indianairports:5000)[‘tourism’=’hotel’];

way(around.indianairports:5000)[‘tourism’=’hotel’];

) ->.nearbyhotels;

(.indianairports; .nearbyhotels;);

out geom;

Note:

  1. I have removed the usual first line [out:json]. This will ensure the output is in OSM XML default format
  2. I have added “data=” prefix because this file will be used as POST input for a “wget” invocation. This is required to clearly say where the POST content begins

Step 2: Invoke the OSM service

wget -O indianairportsrestaurants.osm — post-file= indianairportsrestaurants.opass https://overpass-api.de/api/interpreter

Note:

  1. Take a look at the Overpass interpreter URL we have used and how we are using the contents of the file that we created as POST content
  2. There are multiple service providers for Overpass API. If one is not working, you can use one of these (Read Public Overpass API instances section)

This will create the output OSM file which is in XML format. We need to convert this to “GeoJSON” so that we can use in our Python programs.

Step 3: Install osmtogeojson utility

  1. brew install npm
  2. npm install -g omstogeojson

I use MAC and hence I brewed. Use the appropriate one for your OS. You can also check the osmtogeojson project here.

Step 4: Convert to GeoJSON

Now that we have the conversion utility, all we need to do is to convert.

osmtogeojson < indianairportsrestaurants.osm >indianairportsrestaurants.geojson

Note:

The “<” and “>” are the input, output redirection operators respectively (used in *nix environments).

Part 4: Python geography plots

Folium is a great package for geography-based visualizations. It accepts GeoJSONs to represent geographies, supports multiple types of tiling options, supports plotting heat maps (Choropleths), dropping Pins, mouse-hover and tool-tip support and so on. I will use these features to demonstrate how we can examine the airport and their neighboring hotel distributions in a geography map.

Here is a good website to get started on Folium package for Geo visualization: https://www.nagarajbhat.com/post/folium-visualization/

Before we use Folium, we need a way to read and present the GeoJSON data that we created in the last section. This can be done using “geopandas” package.

Here are the major steps that we will perform:

  1. Read the Airports and Restaurants GeoJSON file
  2. Read the Indian state boundary GeoJSON file
  3. Map the airports and restaurants to the Indian states they belong to (OSM won’t understand this mapping. We have to build this manually)
  4. Build a choropleth map that displays airports and hotels of the states that are selected
  5. Add tooltip to show names of these places

The entire code for this section can be found here as a Jupyter notebook.

Step 1

Read the Indian airport and restaurants Geojson package into a geopandas data frame. A geopandas data frame is very similar to a pandas data frame except that it has an extra “geography” column. The default name is ‘geography’. But you can build geopandas data frame with different column name that serves as the geography info.

So let us begin.

The “read_file” call returns a “geopandas” data frame. Explore the frame (rather very wide data frame) and see the contents of the ‘geography’ column. This column contains Polygons, Points and Multi-Polygons which are geography objects represented using ‘shapely’ package (that’s what geopandas uses internally). Geopandas data frames support all the usual pandas data frame APIs (like filter, join etc.) but also support extra geography functions which implicitly work on ‘geography’ column. For e.g. geo_raw_df.centroid() function will automatically compute the middle point of all polygons inside the geopandas data frame.

Take a look at the various columns. These correspond to various JSON fields inferred from the GeoJSON that had passed. Curious ones can connect the content in the geopandas data frame with GeoJSON content.

We further filter this frame to get 2 separate geo-frames, one for Airport and one for Hotels. The curious can explore these frames to understand more.

Take-away

  1. Geopandas data frames are very similar to pandas data frame except for the presence of ‘geography’ column. Supports regular indexing, slicing, filtering, joins etc. just like pandas frames.
  2. Geopandas provides several geography APIs which implicitly act on the ‘geography’ columns

Step 2

Read the Indian state GeoJSON that we created at the very first step.

We filter this data and we get only the states and union territories of India. Take a look at how we are filtering and the reason why.

Step 3

Map airports and hotels to state and union territory by invoking the point-in-polygon API. We have to do this because the airport and hotels JSON output does not provide state mappings for all points of interest for us.

Take a look at ‘addr:state” property in the airports and hotels geo frame. You will see almost all are ‘null’.

In order to point-in-polygon correctly, we need to move from the lat-long space to another coordinate reference system (CRS built for distances and map projections). Now, a given geopandas data frame can be switched back and forth between different CRS using simple API calls. Examine the contents of ‘geography’ column after each CRS transformation and you will understand what is going on.

Take-away

  1. Coordinate Reference Systems (CRS) are fundamental to any kind of distance, point-in-polygon operations performed on geopandas data frames.
  2. Always convert to Mercator projection (epsg:3857) before calculating distances or invoking any APIs that measure distances or make use of distances.
  3. You can always convert back from Mercator to lat-long using (epsg:4326)

Step 4

Before we build Choropleth maps, we should first build a “metric” for each state so that we can color states appropriately. We can either color by number of airports or number of hotels around airports in a state. We will build both these metrics and keep them ready.

The code below shows how you calculate the metric as well as demonstrates ‘merge’ operation between geopandas and pandas data frames.

Before we visualize things on Choropleth maps, we need to pause and think about how we want to layout and present.

  1. Do we want to show all Airports and hotels in a Geography map and color the states according to airport/hotel count?

Nope. This won’t work. There are nearly 2000 points and displaying them on a graph will lead to a messed-up display. But coloring states according to airport count looks sensible.

Plotting strategy

To avoid plotting all points together, we will introduce a state-filter in our graph so that only airports and hotels of selected states are displayed. This way, the user can choose a state of his/her interest based on Choropleth heat-map and then enable display of airports and hotels on that state of interest, zoom-in the map, then examine them in detail.

Folium strategy

Geography maps are made of layers. The base layer is the tile-mapping that you choose for your map. On top of this, you lay your Choropleth. On top of the Choropleth, you can add layers of pins and so on. To realize our objective, we will make each state as a separate layer. Then we will introduce a layer control which allows users to select the layer they want to see.

Base layer

Choropleth layer

Airport & Hotel state-wise layers

Adding a layer control and saving as HTML

NOTE: The HTML output produced is independent and can be visualized in any browser by anyone. All GeoJSON data are already baked in the HTML and so there is no additional data you need to ship along with the HTML.

Final output

Here is the final output (manual edit for political correctness, literally & figuratively!)

Choropleth heat map based on number of airports

Rajasthan and Uttar Pradesh are on top. If we zoom and check Rajasthan, we can see there are several airports out there. The blue pins are Airports and the green ones are Hotels. The name pops up, if you click on the pin (our tooltips).

Airports in Rajasthan

Take-away

1. Geographic maps in Folium are made of several layers — overlaid one on top of another

2. Users can be given control on which layers he/she wants to see. This helps to avoid unnecessary cluttering of information on the map

3. Folium maps can be saved to HTML and it becomes a standalone document that can be shipped to anyone. No software needed. Just a browser!

Hope this guide helped you understand how to deal with OSM API, geographic data and visualizations in Python! Happy geographing!

Clap, share & spread the knowledge! Thank you!

--

--