OSM Water: How well are Minnesota’s water features mapped?

Published in

Unearth

7 min readOct 2, 2019

For the second straight year, Critigen’s Open Data & Development team was well represented at the SOTM US conference. In our OSM series, we’ll describe our presentations in-depth, reveal our findings, and bring forward important questions to the OSM community.

Authors: Matthew Manley, Alex Sweeney

What’s going on with natural=water?

With so much focus on routing and road features, our team was looking to concentrate on an unexplored corner of OSM. We’ve spent the past few months evaluating the quality of water features in OSM, and wanted to dive deeper into the origin of water data in Minnesota; specifically looking at community-generated data versus externally-sourced bulk imports.

Bulk imports have been a contentious issue within the OSM community. In many ways bulk imports conflict with the core ethos of OSM: empowering local mappers to digitize their local communities. Despite this, automated imports are responsible for a substantial amount of data in OSM, and have come from datasets such as: TIGER, Bing Maps’ building footprints, and the National Hydrography Dataset (NHD) among others. Figure 1 illustrates the effect of two key imports in Minnesota.

*Figure 1. Effect of bulk imports on highway and water feature counts in MN.*

Immediately apparent from this graph are large jumps in highway and water feature counts that correspond to TIGER and NHD bulk imports. Together they represent massive feature creation events in Minnesota, contributing lots of data over very short periods. In this way, bulk imports significantly contribute to “map completeness” in OSM. But how active is the OSM community in editing these bulk imported features? And to which degree has the data “moved beyond” the import? Examining how and if these features have changed since ingest can provide insight into how active the OSM community is in maintaining and updating features after bulk imports.

That is where we decided to focus our study: identifying how OSM water data evolves from the bulk import phase to its current state. To narrow our geographic scope, we used Minnesota as our case study given that SOTM US convened in Minneapolis this year (it also doesn’t hurt that it’s known as the “land of 10,000 lakes”!).

Methodology

Inland water definition

In order to define “inland water” we turned to the OSM basemap itself. A CartoCSS Style Sheet definition controls how features are rendered on OSM. We selected the key value pairs that are symbolized as water and used those to build our inland water definition. You can see the tags we ended up with in the snippet below.

landuse=basin,reservoir
natural=spring,water,wetland
waterway=canal,ditch,drain,stream,river,riverbank
wetland=tidalflat,reedbed

Time-Slices

Since we wanted to analyze inland water features over time, we captured three “time-slices” of OSM data derived from the full OSM history PBF file using Osmium Tool. For those that aren’t familiar, Osmium Tool is an awesome command line tool for manipulating and extracting OSM data. The first step in creating these slices was soft-clipping the history file to the Minnesota boundary. We filtered the PBFs to only include features that fit our definition of inland water and generated three time-slices for our analysis:

Pre-NHD: 2009/09/30
Post-NHD: 2011/09/30
Current state: 2019/04/07

Change Analysis

In order to find geometric changes of inland water features, we uploaded our PBF time-slices to a PostgreSQL database using ogr2ogr and created custom queries to classify changes. Since we were primarily interested in what happened to features after NHD imports were completed, we compared the 2011 and 2019 time-slices. We used PostGIS and SQL functions to classify each feature into one of five categories. Figure 2 shows the breakdown in these classifications.

No Change

Using unique OSM identifiers (ID), we inner joined features from the 2011 and 2019 time-slices and compared their geometries using the PostGIS function ST_Equals. If the geometries didn’t change between time-slices we categorized it as no change.

No polygonal change sql snippet.

Change

Conversely, to isolate features that did change we used an inner join to find everything excluded from the no change category; creating a mutually exclusive set. From here we bucketed features into five different categories.

i. Change in Geometry

To find features with a change in geometry we isolated those with the same OSM ID in 2011 and 2019 but with differing geometries.

Polygonal geometry change sql snippet.

ii. Replaced

If an OSM ID wasn’t found in both time-slices, we checked to see if there was a spatial representation for that feature in 2011 and 2019 but having differing OSM IDs. We accomplished this by using a combination of the PostGIS functions: ST_Intersects and ST_Touches, to find coincident features that overlapped each other. For example, if an OSM ID was only present in 2011 and intersected but didn’t just touch a feature in 2019, we determined that feature to be replaced.

Polygonal replacement sql snippet.

iii. Deleted

We identified deletions by isolating OSM IDs that existed only in 2011 and weren’t replaced by a feature in 2019.

Polygonal deletion sql snippet.

iv. Added

Similarly, we found additions by isolating OSM IDs that existed only in 2019 and were not replacements for features in 2011.

Polygonal additions sql snippet.

Findings

The State of Water in Minnesota

In terms of the composition of OSM water data in Minnesota, we found that the overwhelming majority of data was imported from NHD: 84% of polygons and 98% of lines. This speaks to the importance of bulk imports more generally. If NHD hadn’t been imported, it’s unlikely that similar amounts of data could be generated by the community in such a short amount of time.

Identifying Change In Bulk Imports

By leveraging inner joins and PostGIS functions, we found that only 15% of features imported from NHD had changed between 2011 and 2019. The remaining 85% of features were left unchanged. We successfully identified four different types of change: changes in geometry, replacements, deletions, and additions. Figure 4 illustrates this breakdown. In examining our results we found several examples of changes to NHD data that added to the overall accuracy of the map. We isolated six examples of this and presented them in a slider visualization here. These cases illustrate the OSM community’s contribution to validating and editing bulk imported data.

*Figure 4. Changes in NHD water features between 2011 and 2019*

Conclusions & Next Steps

Bulk imports can be a huge asset for the OSM community, as long as they‘re done in a careful and mindful manner. The community should be ready to address any inaccuracies that are introduced into OSM via the import of external data. The examples we presented in our slider visualization indicate that users are indeed interacting with and improving bulk-imported NHD data. Continuing to encourage this type of editing would be beneficial to the community as a whole as these edits are in themselves a form of validation.

However, we were intrigued by the low rate of change for the bulk-imported features from 2011–2019. This may suggest that even though bulk imports can drastically improve map completeness, the rate at which the community interacts with these features may be quite low. What we weren’t able to answer was — “Why?”. Here are three possibilities:

Perhaps the imported data is accurate and doesn’t require editing. This may be the case for some features, but we identified some inaccuracies within our analysis.
Another possibility is that bulk imports truly do decrease editing rates due to a psychological component: if users see a relatively complete map they may be discouraged from editing.
Or a simpler explanation: the rates of editing water features may be low in general. This would need to be examined in contexts beyond Minnesota, but is nevertheless a possibility. One of the reasons we chose to analyze water was due to our perception that water features are often forgotten.

Along with answering the above question of why editing rates are low, there are many other interesting directions to pursue in the future. For one, we’d like to see how updated NHD data stacks up to OSM. Just how much has NHD changed since import and is there an opportunity for refreshing the data? We also concede that this analysis doesn’t take into account changes in tags, and future work should integrate a tagging time-series analysis.

Community Feedback

The community feedback we received was positive and a lively Q&A session followed our presentation at SOTM US. A couple of themes emerged from this discussion and primarily centered on how replacements were classified, how NHD features were mapped to OSM tags, and the feedback validation loop between external data sources and OSM. For more insight into our analyses take a look at our github repo for the SQL behind it. As for NHD features and how they were mapped to OSM tags, this resource describes which key pairs were used upon ingest.

We believe that the feedback loop between external data sources and the OSM community is a compelling topic for further investigation.

Can we establish a communication channel between third party resources and OSM, and if so, how? What’s the best way to combine rich datasets and an active community?

Up next, we’ll look to explore the differences between this year's NHD data refresh and what’s live in the OSM database.

Stay tuned & follow Critigen’s Unearth publication for the second part of our OSM series. We’re also live on twitter @osmquality!