Address quality in OpenStreetMap: a comparison

Published in

reachnow-tech

8 min readOct 8, 2020

The Place Search Team at moovel takes care of the place and commuter search experience in our products. Our goal is to improve the users’ overall search experience. Part of that task is having a usable and coherent place search feature with which users are able to find places they are looking for with minimal effort.

While developing and maintaining a place search engine is one part, which any ambitious team could aim at, having full control over data is a rather impossible challenge. We distinguish between three different data sources internally: Public transport stops, Points of Interest (POI) and Address data.

In the context of mobility and multimodal transit apps, data requirements are manifold. We cannot accept any (minor) differences in public transit stop information: stop names in places search aim to be as similar as possible to stop names in reality (or as users perceive them). Furthermore, locations must be exact such that we can calculate the possibilities and estimated time of arrival.

Requirements to POIs and addresses are — compared to public transit stops — less strict. Slightly different names can be accepted as long as users can easily match their expectations to our data. Minor location differences (e.g. up to 10m) or the lack of a specific address on a street where we have all others can also be accepted.

For example, for Glockengießerwall in Hamburg in our proprietary, comprehensive & costly data source we have numbers 5 and 5a, while OpenStreetMap (OSM) will show just Glockengießerwall 5. And for the case of addresses it is acceptable for us, but it could be easily improved.

We have already been using OSM and OpenAddresses for quite some time for other parts of our service and supported regions of operating. However OpenAddresses’ world-wide coverage is not that good, especially in Germany, which is the focus of this research. So we decided to give open data from OSM a chance to compete on the German address search field, assuming it would not be as good as our reference address data set.

What is the quality of an open data set then?

In our address search we leverage a comprehensive and commercial address data set, which we can use as a reference and try to compare to OSM. We want to seek answers for two questions:

How good is OSM in terms of address quality?
Does it meet our requirements?

To do so, we decided to take some steps which will be described in the upcoming sections.

Define the way a comparison can be made

We selected some common areas for both sources, e.g. zip-code, and compared street and address data from the same area in different parts of the country. The map below shows the selected regions for a representative comparison.

selected regions for a representative comparison

Retrieving data

OSM data was downloaded from Geofabrik as a .pbf file and extracted using the pelias-pbf2json tool. Below we can see an example of how we can retrieve addresses from a .pbf file.

export ZIP_CODE=10407./pelias-pbf2json.darwin-x64 \
  -leveldb=leveldb_folder \
  -tags="addr:housenumber+addr:street+addr:postcode~$ZIP_CODE" \
  -waynodes berlin-latest.osm.pbf \
  | jq -s -r '.[] | [
   .tags."addr:street",
  .tags."addr:housenumber",
  .tags."addr:postcode",
  .tags."addr:suburb",
  .tags."addr:city",
  .type,
  .lat,
  .lon,
  .id] | @csv' > berlin-$ZIP_CODE.csv

In the example above we use jq, a JSON processing tool. We filtered the result output by fields which are of interest in terms of address comparison. As a result, we’ve got a .csv file for addresses within the given zip-code (10407 in this case), containing all interesting properties such as street, house number, coordinates, city and suburb. A fragment of the result .csv can be found in the following table.

Our proprietary data source already has data in the CSV format.

OSM is a widely known open-data project with a free-access community, which inevitably leads to ambiguity and data quality problems, like mistakes, typos, different address naming schemes, several kinds of duplicates, among others. In order to neutralize these errors or inaccuracies we took the following steps.

Expand house numbers for OSM addresses

All future manipulations were made in Jupyter Lab. We imported .csv files into a Pandas DataFrame (DF). Pandas is an open source library for data analysis using the Python programming language. All future manipulations were made using these tools.

Node extraction

Nodes, ways and relations are basic components of OSM’s conceptual data model. Though all of them might have address components, we are interested in nodes only. As a next step, we exclude ways and coordinates from the DF, since they do not play a part in comparing components.

In the following picture, we can see the fragment of a retrieved DF for one zip-code.

The first problem we encounter is how house numbers are stored, like Am Friedrichshain 29–32 for example. In our proprietary data source, that would usually mean having a few rows for Am Friedrichshain 29, Am Friedrichshain 30, etc. Other options include:

StreetName 3a-c
StreetName 3,4
StreetName 3;4
StreetName 3/4

We want to be able to unfold the examples above into different addresses. We can do so by searching for special characters in these strings, as shown in this small script. The data frame will look different after unfolding these house numbers.

data frame looks different after unfolding house numbers

Normalization

Similarly to house numbers, german street names are written in different ways. For example, we found a street which is named Berliner Straße in one source and Berliner Str. in the other. In this case, we can assume it is actually the same street and no version is better than the other. Let’s normalize streets spelling of names:

def normalize_osm_data(df):
  for i in range(0, len(df)):
    if 'St. ' in df.loc[i, 'street']:
      df.loc[i, 'street'] = df.loc[i, 'street'].replace('St. ', 'Straße')
    if df.loc[i, 'street'].endswith('Str.'):
      df.loc[i, 'street'] = df.loc[i, 'street'].replace('Str.', 'Straße') 
    if df.loc[i, 'street'].startswith('St.'):
      df.loc[i, 'street'] = df.loc[i, 'street'].replace('St.', 'Straße')     
    return df

Comparison

For each area analyzed we consider the following criteria:

Numbers of unique addresses for each source and area — We expect different results for both sources and this needs to be checked
Unique street names — Either data source might miss not only certain addresses but also a whole street
Differences on address data — It’s a combination of the previous two metrics, which should give us some numbers to make a conclusion on potentially switching data sources.

Our assessment then consists of three steps:

Checking unique addresses in both data sets
Checking numbers of unique street names
Overlapping information in both sources

Unique street names can be found through:

def get_unique_streets(df):
  df = df.sort_values(by ='street' , ascending=True).reset_index()
  df = df[['street']]
  df = df[df.duplicated() == False]
  return df

Unique addresses and saving results into files for future analysis can be found through:

def unique_adress_names(df1, df2):
  merge = df1.merge(df2, how='outer', indicator='source')
  result_l = merge[merge.source.eq('left_only')].drop('source', axis=1)
  result_r = merge[merge.source.eq('right_only')].drop('source', axis=1)
  result_b = merge[merge.source.eq('both')].drop('source', axis=1)
  result_l.to_csv('export_csv/STREET-NAMES_only.csv', index=False)
  result_r.to_csv('export_csv/STREET-NAMES_only_OSM.csv', index=False)

The previous examples show that some data processing is required before we can compare the two sources: i.e. normalization, data expansion and data deduplication.

At the same time, open data has advantages over our proprietary data source. For instance, OSM could be edited and corrected right away, while our proprietary data source is only updated in regular intervals and we have practically no influence on it.

Having results for each zip-code we can simply calculate the average for all considered zones, make some corresponding charts and analyse them.

Results

In the graph below we can see the absolute number of unique street names in the involved regions and number of addresses.

Zones 1 to 9 are inner-city areas of big cities like Berlin, Hamburg, Stuttgart, Cologne or Dresden, whereas zones 10 and 11 are parts of smaller towns in the south of Germany. This discrepancy can maybe be explained by the fact that OSM coverage is generally better in urban areas, since it is a community-driven source where content may vary with population and activity.

Similar to the numbers for addresses, we can see from the graph below that both sources follow a common pattern and show quite similar numbers in comparison to each other, despite some differences between zones. Data in urban areas is possibly adopted faster than in rural areas, which causes differences in street numbers. The numbers for Germany are actually higher than we might think: In 2017, around 11,000 streets were edited — either newly built, partitioned or renamed.

Let’s now take a deeper look at both sources. We are now comparing each individual address and see if we find an exact match in the other data set. For the sake of simplicity, we aggregate the numbers over the selected zones and calculate their averages. From the next chart, we can see that 10% of all addresses included in our proprietary data source could be found only here, while the remaining 90% are available also in OpenStreetMap.

Aside from completely missed addresses for any reason, quite a large number of misreadings in OpenStreetMap are cases like Waldowstraße in Berlin. In our proprietary data set, Waldowstraße has the numbers 6A/6B/6C, whereas in OSM just the number 6. Meanwhile, it’s actually just the same building! Another example: small (sometimes “private”) streets and either part of a park area, like Burg 2, in Esslingen.

However, OpenStreetMap also contains addresses that can only be found there, with numbers looking similar to our proprietary data source, with a rate of 7%. The following chart shows the average address-found ratio for all involved areas.

However, 87% of all available addresses from a shared set could be found in both sources. While a unique part of addresses from our proprietary data source is 9%, and for OpenStreetMap 4%, according to the upcoming diagram.

Conclusion

Regions involved in this analysis cover main areas of interest to us, big and smaller cities. Based on results and the possible reasoning on differences, we conclude that OSM is a valid and budding alternative to our proprietary address data source in many contexts and for many applications.

Using OSM as an address data source will of course require some normalization, deduplication and other kinds of processing work, but if we are looking for an Open Data alternative for an address data source — OpenStreetMap might be one of the options that is definitely worth considering.

written by Sergey Chorba