Querying OpenStreetMap Buildings with AWS Athena

Published in

door2door Engineering

6 min readAug 31, 2017

A few months ago, AWS not only made OpenStreetMap planet data available on S3, but it also made it query-able with Athena. Pretty cool, no?

Now, in theory, you can just construct an SQL query, send it to Athena, and then do whatever you want with the results. No more:

Updating OSM planet data yourself; it gets updated on AWS whenever OSM publishes it, once a week.
Transforming the data into a query-able format; Athena handles that for you.
Query/request frequency limits (it’s still AWS though, so other limitations might apply 💸 💸)

At door2door, we had a pretty straight-forward use-case for this: we needed to get buildings in specific regions based only on where they were, and transform those buildings into GeoJSON that we can attach our data to, and visualize on the front-end on top of our base map.

0. Pre-requisites

Have an AWS account (💁)
This tutorial uses boto3 (AWS SDK for Python) for the demo code, but the underlying principles should be the same with other SDKs as well.
Know that Athena uses PrestoDB, which has a few extra/specific functions that we’re going to use (no prior knowledge of those is required though).
Also know that Athena is currently only supported in us-east-1. Edit: this is no longer true; you can see where it’s supported here.
Some knowledge of the OSM data structure.

1. Set up the planet table

This is only done once — the first time you’re accessing OSM data on AWS Athena. Run this in the web Athena console (or through an AWS SDK):

What this does is create a regular SQL table. Please note that:

Every entry can have tags: a list of key=value pairs that describe what the entry represents.
Entry “type” can be one of those: node, way, or relation.
Only entries of type “node” have latitude and longitude values.
Only entries of type “way” have nds (nodes), which is a list of the IDs of the nodes “in” that way.
Only entries of type “relation” have members, which is also a list of IDs, but it can point to nodes, ways, or other relations. It also includes the roles for those members inside the relation.

2. The definition of a “building” in OSM

A way, or
A relation that has the tag set: type=multipolygon

…that has, in its tags, the key “building” paired with any non-null value.

Now, to get the buildings in a specific region, there’s one more condition to that entry: all of the nodes referenced inside it have to exist within the bounding box of the region we want.

So what we want looks a bit like this on a high level:

3. Construct the SQL Query

For the purpose of this tutorial, we will only consider the following cases:

Buildings that are ways.
Buildings that are relations whose members include at least 1 way with the role “outer” (so we only get the outline of the buildings)

More complex cases include:

Getting the inner polygon (just slightly more complex parsing).
Buildings that are relations whose members also include relations (also known as Super Relations) which can have an unlimited amount of levels (this is a limitation of this approach, explained in more details below).

To start writing the query, we approach this “from the ground up”:

Get all the nodes within the given bounding box.

Get all the ways. Ever.

Get all the multi-polygon relations that contain “outer” ways.

Select the buildings!

The reason we keep the node_position is because nodes are ordered within a way, which makes it easy to construct a polygon and visualize that.

4. The Query Cycle

In general, to query Athena from a project/script, you do four things:

1. Send the query to Athena:

Arguments:

The SQL itself.
The S3 path where you want the query results to be saved (as CSV).

Returns:

The query execution ID (note that the filename of the results in S3 is <query_execution_id>.csv)

2. Wait for the query to be processed:

Athena, at the moment, doesn’t have any “waiters” which are common with other AWS services, like S3.
Instead, write your own waiter. Or, alternatively, wait for the file to be available on S3 (using an S3 waiter to continue in the same script, or SNS to trigger a Lambda function, for example, that processes the results).
A pull request has been submitted to add a waiter to Athena, but there’s no guarantee this functionality will be available (at the time of writing this).

3. Download the query results from S3.

4. Process those results.

Nodes inside a “way” are ordered, so to construct a building, we could use a library that creates Polygons (like shapely, in Python) and just put in the nodes in their stored order.

So the entire script could look like this one here 🔗 (part of the demo repo linked below).

5. Results!

For the sake of this tutorial, I ran this code to get all the buildings in Iceland. And it looks like this:

Part of the buildings generated in Iceland (red)

You can see the entire region’s buildings (zoom in and out, play with them) here:

Iceland Buildings

Buildings in Iceland generated by AWS Athena.

mariam.dev

Also, the full code on github:

mariamrf/athena-buildings

A script that queries OSM using AWS Athena for buildings in a given bounding box. Demo: - mariamrf/athena-buildings

github.com

Limitations of using AWS Athena with OSM data

Super-Relations

Since there’s no limit on how nested relations can be, querying it from the ground up (from the node-level, because that’s where the coordinates are) is tricky; according to the Athena user guide (page 90), recursive WITH statements are not supported, so any recursion would have to be done from the client side (query, process results, then keep querying).

Ideas on how to solve this problem:

Brute-forcing it: get all the super-relations ever with the tags you want and keep going down (in levels) until you get their base nodes, then filter by their coordinates.
From the ground up (in the buildings example): get all the filtered nodes, then their parent ways, and if those ways have parent relations, check if those relations have parents, and so on until no relations have any parent relations, then filter the top-level elements by tags.

The problem with both ideas is having to wait an entire query cycle for every recursion (and in the first case, getting all buildings super-relations in the entire planet can be costly both in terms of data scanned, and therefore money, and in terms of memory on the device processing the results).

Conclusion

Querying OSM data with Athena can be pretty cool depending on your use case. If your queries are relatively small/simple or infrequent, for example, this is a huge improvement compared to handling the data yourself, both from the perspective of time and money spent.

What do you think of it? Would you give it a try?