Impressive work.
Stefan Keller
11

Libpostal is more forgiving on street names/POI names, but for city names and other toponyms we rely heavily on gazetteers which are compiled dynamically from the training data (OSM, OpenAddresses, GeoPlanet, etc). If we were training in only one language/country it might be possible to simply rely on structural features and let the CRF find the best parse but because we handle international addresses with many different structures, we have to rely more on attributes of the words/phrases themselves. It seemed (and still does seem) to me to be a reasonable assumption that we can list every possible toponym in the world in our training data, even if we don’t have proper addresses everywhere. The cases you mention sound like city prefixes that are valid but somewhat less commonly used, so probably do not appear in OSM. However, in libpostal we can add structural elements like city prefixes to the data for an entire language or country some random proportion of the time according to any discrete distribution we specify. This is already done for Russian and could easily be added for German or other languages if needed. Just create an issue on Github for it and I can add those prefixes to the next release. Optionally prefixes can also be tied to specific OSM tags (in Russia there’s ru:place_type which can determine whether a place should be a “gorod” or a “derevnya” for example).

Possibly related, the parser does not handle spelling mistakes in place names at present. Different transliterations or names in different languages are fine as long as they’re listed in OSM, etc. but it won’t handle something like “New Yrok” correctly (users are welcome to add their own spelling correction — it’s simple to use our training data to compute the dictionaries).

Show your support

Clapping shows how much you appreciated Al Barrentine’s story.