Statistical NLP on OpenStreetMap: Part 2
training Conditional Random Fields on 1 billion street addresses
A while back I wrote a piece publicly introducing libpostal, an open-source, open-data-trained C library and companion NLP model for parsing and normalizing international street addresses. Since then, libpostal’s user base has grown to include governments, startups, large companies, researchers, and data journalists from over a dozen countries around the world.
Today I’m very excited to announce the release of libpostal 1.0, featuring a new address parser trained on over 1 billion examples in every inhabited country on Earth from great open data sets like OpenStreetMap and OpenAddresses. There are several new tags for complex sub-building structures (unit, floor, staircase, etc.), P.O. boxes, and categorical queries, each handling around 30 different languages. Best of all, the address parser now features a much more powerful machine learning model (a Conditional Random Field or CRF), which can infer a globally optimal tag sequence instead of making local decisions at each word.
The new model achieves 99.45% full-parse accuracy on held-out addresses (i.e. addresses from the training set that were purposefully removed so we could evaluate the parser on addresses it hasn’t seen before). This is more significant than just a half a percent improvement over the original parser’s 98.9% result because now it has to deal with more labels and more languages/countries.
The new parser in GIF form
Trained in every inhabited country on Earth
In the initial release there were several countries that could not be included either because we did not have address formats for them or because, in the case of East Asian system addresses e.g. in China, Japan, and South Korea, the format depends on which language/script is used.
The 1.0 release substantially improves both the breadth of libpostal’s coverage by extracting more records from more places in OSM (there are still conspicuous gaps, but it’s much improved), and the depth of coverage by adding sources like OpenAddresses. Let’s start with the OSM changes.
East Asian system addresses
Obtaining suitable training data for e.g. China, Japan, and South Korea presented the most difficulty because the address format depends on which language/script is being used. In Chinese, Japanese, or Korean, addresses are written largest-to-smallest i.e. starting with country, whereas when written in Latin script they’re written smallest-to-largest according to Western conventions. As such, the initial release of the libpostal parser was not trained on any of these countries (we had the addresses before but they would have been in the wrong order).
After a few upstream pull requests to the international address formatting templates used in libpostal it’s now possible to use different formats for a country depending on the language. These changes opened up training data in China, Taiwan, Hong Kong, Macau, Japan, and South Korea, and libpostal 1.0 can handle them all. For toponyms, at least in larger cities, OSM usually contains at minimum a local-language name (name=“大阪市”), a transliterated/Romanized name (name:ja_rm=“Ōsaka-shi”), and an English name (name:en=“Osaka”), so as long as the data’s available it should be trained on each of these variations.
Japan presented an additional challenge as the requirements for inclusion in the OSM address data set used to be that an address had to have both addr:housenumber and addr:street, but in Japan very few addresses have street names. Instead, Japanese addresses are usually composed of neighborhood (chōme), and block + building number (bangō), e.g. “一丁目5番3号” (1-chome, 5-ban, 3-go). This is currently handled with a special data set where only addr:housenumber or venue name is required and the rest of the components are obtained through reverse geocoding.
Randomizing address formats
Even outside of the East Asian system there’s often more than one way of writing an address. For instance, an English-speaker living in Berlin might write “1 Herbert-von-Karajan-Straße” whereas the correct format would be “Herbert-von-Karajan-Straße, 1”. Previously, the parser might’ve gotten the first address wrong because it had only ever seen the correct ordering in the training data. Similarly, geocoding APIs like Google Places sometimes return results in an order that differs from the official country formats used in the training data.
Libpostal now contains formatting configs which can, with small random probabilities, switch the ordering of certain address components so the parser has to learn multiple formats. There are probably still a few missing conventions here and there but this release covers most of the issues that have been reported.
Admin boundary mappings
Most of the time an OSM address simply contains, say, a house number and a street name sans city, state, country, etc. so we obtain that information by reverse-geocoding to boundary polygons. Libpostal uses a standard nomenclature for admin levels: “state” for first level administrative divisions of a country, “state_district” for the second level, etc. OSM has no such nomenclature, only a notion of 11 or 12 numbered admin levels which can mean something different in each country.
Libpostal maintains a config file per country which maps the values of OSM’s admin_level (and sometimes various other properties like “designation” in the UK) to our taxonomy. These configs have been expanded to include every country in the world and the existing configs have been audited to make sure place names are getting classified as we’d expect. They’re easy to edit, and allow for overrides/exceptions. Exceptions can be one-offs e.g. Moscow the region should be considered a city because there’s no separate boundary in OSM for the city, or based on a containing polygon e.g. admin_level=5 means something different within Moscow than it does in the rest of Russia.
Libpostal 1.0 introduces two new admin-level tags that came up as part of the place mapping expansion/audit:
- country_region: this represents an informal subdivision of a country without any official/political status.
- world_region: currently only used for appending “West Indies” after the country name, a pattern frequently used in the English-speaking Caribbean e.g. “Jamaica, West Indies” or “Grenada, W.I” (thanks to Javid al Karuzi from Jamaica for pointing this out). This tag could also be used for continents at some point.
No umlaut left behind
In the new release, libpostal makes only the slightest of modifications to the user input, namely lowercasing, NFC Unicode normalization, and replacement of HTML entities e.g. “&” becomes “&”. We no longer convert “ä” to “ae”, which is not even a correct normalization in languages like Finnish. Instead of using the same form of transliteration at training time and runtime the parser now trains on multiple forms of the string (with accents, accents stripped, and Latin-ASCII transliterated with language-specific transliterators for German, Estonian, and the Scandinavian languages). This way, the parser can accept input in any of those forms.
Reverse geocoding to buildings
Many times a point of interest or venue in OSM is listed only with its name and venue type (e.g. amenity=clinic), whereas the house number and street will be listed on the containing building.
In the newer training sets we can reverse-geocode to over 35 million building geometries around the world. This might sound like a memory-hungry index, but in fact we only need to store an R-tree in memory (still ~2 GB, but not even close to the size of the admin polygons). This is because most of the building polygons are tiny, the bounding-box match obtained from the R-tree is highly selective, and because each polygon will only be needed a few times. The actual geometries are kept on disk in a LevelDB with a small LRU cache on top.
Point-based places and neighborhoods
Not everywhere in OSM is well-mapped, and in a number of places the best representations we have for cities are points rather than polygons.
Previously, we would only add cities and other place names to addresses using point-in-polygon tests, meaning we’d end up ignoring point-based cities altogether. Now we have a geohash-based index that
can find nearby points within a certain bounding box when no city can be found through reverse-geoding.
OSM also has almost no neighborhood geometries, so for countries not covered by Quattroshapes or ClickThatHood (our sources for neighborhood reverse-geocoding), nearby neighborhood points from OSM can also be used to augment the address.
Places-only data set
Another frequent reason for missing city names in libpostal’s data was places that either contained no OSM addresses or had invalid/wonky polygons. So there’s now a training set that makes sure every toponym gets some minimum level of representation in the training data. If the record contains population statistics the number of training examples is increased proportional to the population. This way, if we have no addresses for two places with the same name and two different place types (a neighborhood and city perhaps), the larger of the two will be preferred.
For all these training examples we add parent admins (which have higher probability of being added for smaller places — e.g. Oakland, Iowa needs context but Oakland, CA can just be “Oakland”), and any postal codes associated with the toponym if available.
Streets-only data set
In many countries there’s better road network data than addresses, probably because OSM is built from satellite data and labeling streets is one of the first steps in making a map.
Similar to the places-only data set, the idea here is for the parser to get a sense of what streets look like in languages that are not otherwise well-represented at the address level like Arabic, Farsi, and Chinese. We’d also expect the street names in OSM to change more quickly than its addresses in places like eThekwini/Durban, South Africa where street names are being rapidly decolonized i.e. being renamed to honor those who’ve fought against oppression rather than their oppressors. Decolonization is important in the US context as well — street names reflect who and what we canonize. There are more than a few Jefferson Davis Highways where Angela Davis Highways ought to be.
In addition to the OpenStreetMap data, there are two other notable sources that make up the complete training set of over 1 billion examples.
The OpenAddresses project is a free, openly-licensed (mostly CC-BY) data set of address points all over the world. The data come directly from government sources, typically a cadastral (land ownership) registry used for tax assessment, etc. As such the addresses are both messier and more complete than OSM. If a city/county/state/country is in OpenAddresses that means that virtually every single address on every single street will be part of the data set.
It’s this depth of coverage that’s most impressive, although OpenAddresses has been steadily expanding the breadth of its footprint over time. In addition to covering almost every population center in the United States and most of Europe, it now has near-countrywide coverage in Brazil, Japan, Australia, New Zealand, Kuwait, Mexico, Belarus, and Kazakhstan, along with major cities in Colombia, Russia, South Korea, Ukraine, and South Africa, to name a few.
Libpostal is now importing most of OpenAddresses as parser training data, which helps in 3 important ways:
- OpenAddresses is built on complete county GIS data sets, so it tends to contain streets not in the OSM road network. This allows libpostal to learn a reasonably complete model of what street names look like, such that words not known to the model are more likely to be part of venue names (venues may never be well-covered in OSM/OpenAddresses, tend to use many rare words, and change frequently).
- To deal with abbreviations in the OSM addresses, libpostal chooses abbreviations uniformly at random from its dictionaries (“St” and “Str” are equally likely abbreviations for “Street”), whereas OpenAddresses contains a more reasonable distribution of how address abbreviations are actually used by government sources.
- Most of the addresses in OA have postal codes and city names, which help us to build an index of valid postal code contexts (10001 could be a house number but this is less likely if it’s preceded by “NY”/“New York”).
Given that the corpus is so large there are plenty of “redundant” addresses as far as libpostal training is concerned i.e. we don’t learn much from having every single address on a given street (10–15 would probably suffice). However, in practice the additional data doesn’t hurt training and shores up some sparsely-mapped areas of OSM.
GeoPlanet postal codes
Though no longer maintained, Yahoo’s GeoPlanet is a ubiquitous data set throughout the geo world. Despite a few quirks (when was there ever a neighborhood called “Adelphi” in Brooklyn?), one of the most compelling things it contains is a database of postal codes and their associated admins for a number of countries including virtually every existing postcode in the UK, Canada, Portugal, Japan, and India among others. So we constructed a training set with simply postal codes and their associated city/state/country. This comes in handy for distinguishing postcodes from house numbers (more on that later).
Libpostal coverage by country and data set
Note: for those curious why there are only 241 countries in the OSM data set, 2 of those (Gibraltar and Jersey) were not labeled with an ISO code in OSM as of this writing and so were lumped in with the UK. Western Sahara is disputed and also lacks an ISO code in OSM, and the other 5 include largely unpopulated arctic territories, as well as Antarctica itself, which is really not a country but sometimes needs to be listed as such in geographic systems.
The first version of libpostal didn’t really have a way to handle apartment numbers, and this has been one of our most requested features. For most geocoding-related applications unit information is not directly useful as users would rarely type an apartment number into a map-based search box. However, for those without the luxury of user input e.g. batch de-duping CSVs, dealing with units is commonplace. We want to be able to handle incoming addresses with apartment/flat numbers, etc. even if only to discard them for a subsequent geocoding step.
This presented a challenge in terms of acquiring training data. Unit numbers are quite rare in OSM. There is an “addr:flats” key, but it often contains simple numbers/ranges whereas ideally we want the parser to handle variants like “Apt 123”, “#123”, “Apt. № 123A”, etc. across many languages. We also want to be able to recognize expressions like “3rd Floor”, etc. Sub-building expressions follow only a few distinct patterns in almost every language. We don’t know much about the interiors of buildings from OSM, but if the primary goal is simply for libpostal to understand these expressions we can “hallucinate” them i.e. add apartment numbers, etc. to addresses at random so when libpostal encounters real sub-building information in the wild it knows what to do.
In the 1.0 release, the parser generates these expressions in the 35 most popular languages in OSM, and includes several new labels to handle them: unit, level, staircase, entrance, and PO box.
This can be an apartment, suite, unit, lot, office, or any other secondary unit type. In Australia, one can find an even broader range of unit types on an envelope including “Coolroom 8”, “Antenna 237”, and “Marine Berth 12”.
Sometimes unit numbers are appended to the building number with a hyphen or a slash. In this case, libpostal does not try to parse out which side is the house number and which side is the unit, as the formats are highly country-specific (and the parser does not know/predict which country it’s in). Users can split the tokens further as a post-processing step if needed.
Floor indicators are quite common in company addresses and large apartment buildings.
Usually floors are simply numbered e.g. “2nd Floor”, “Fl 2”, or “2/F” in English, “Piso 8” in Spanish, etc. In most of Europe and Latin America, there are also special designators for the ground floor:
- bajos in most of the Spanish-speaking world. In Spain it’s planta baja (Castilian Spanish), pis baix (Catalan), or beheko solairua (Basque).
- rez-de-chaussée in French, parterre in Swiss French.
- andar terréo in Brazilian Portuguese (rés-do-chão in Portugal).
- begane grond in Dutch.
There are often also special words for half-floors (mezzanine), basements and sub-basements, the top floor (e.g. penthouse or attic), and sometimes even a special name for the first floor above ground level (piano nobile in Italian, entresuelo in Spanish). In Spain there’s even a special name for the 2nd floor above ground level (3rd floor for Americans): planta principal after which the floors are numbered.
The address configs allow us to select a random floor number (constrained by the building’s height if known from OSM) and build a phrase which can be a simple numeric phrase like “Floor 3”, an ordinal phrase like “3rd Floor”, one of the special floor types above if applicable, a Roman numeral in some languages/countries, or even a numeric expression using libpostal’s numeric spellout e.g. “Third Floor” in English or “Dritte Etage” in German.
Most English-speakers have probably never seen an address that lists which staircase to use, but in large apartment blocks in places like Romania, Finland, and some parts of Vienna, it’s not entirely uncommon.
Again this is less common, but for instance in Tel Aviv most residential addresses have a Hebrew letter associated with the entrance. In Bulgaria, it might be common to see “Vhod B” in Latin script or “Вход Б” in Cyrillic.
Libpostal only generates/handles the simplest style of numbered/lettered entrances currently i.e. we don’t generate directions like “Entrance on 34th St” as there are too many variations and that would require accessing the road network.
While not at all useful for the purposes of geocoding, PO boxes are quite prevalent in mailing addresses in rural communities as well as in addresses for companies, government offices, and non-profits/NGOs.
These are also very easy to generate. Some fraction of the time, after creating an address training example we can strip it down to the admin components only (city, state, etc.), pick a phrase and a random number from 1–6 digits (and/or possibly a letter), and create examples like “P.O. Box #1234, New York, NY 10013” in many languages.
Simple place name queries
Initially libpostal focused on parsing full street addresses, venue addresses, etc. However, in the geocoding use case, many queries from users are simple place names such as “Oakland, CA” and we’d like the parser to be handle these simpler queries as well.
Parsing place names relies heavily on memorization via dynamically-built dictionaries (sometimes called gazettes in the NLP literature) constructed at training time. These dictionaries allow us to look up whether the current word is part of a known phrase, and the phrase’s possible types e.g.“New York” can be a state, city, neighborhood, etc. So in the 1.0 release we’ve tried to make sure virtually every place name in the world makes it in to libpostal’s training data, and is labeled consistently.
The first version of libpostal treated tags from OSM as givens and only reverse geocoded to polygons for cities, etc. as needed. This method was error-prone for a few reasons:
- addr:city=“New York NY”. If the phrase “New York NY” is added to the place dictionaries as a potential city name, the parser will have a difficult time disambiguating it from “New York” as the city, “NY” as the state.
- addr:city=“Brooklyn” when the polygon index classifies Brooklyn as a city_district. Brooklyn certainly can refer to a city (Brooklyn, Connecticut) or a neighborhood (Brooklyn, Jacksonville), and libpostal can handle that kind of disambiguation from context words, but what it shouldn’t have to do is learn to distinguish between multiple interchangeable ways of referring to the same Brooklyn.
There are now a number of checks that are done across the training sets to make sure that user-specified names do not contain multiple places, and that place names are used consistently throughout the training data.
Cities and other place names may contain abbreviations when written by humans, whereas they are discouraged in OSM. In larger cities like Fort Lauderdale there’s probably an address or two where a human wrote addr:city=“Ft Lauderdale”, but in a smaller city like Fort Walton Beach, this is rarely the case. The initial release of libpostal abbreviated street names at random in the training data, and in 1.0 we’ve started doing the same thing for toponyms from OSM, so the parser should now be able to handle most common variants like Fort=>Ft, Saint=>St, etc. as well as some scarcely-used abbreviations like “S Francisco” for San Francisco (scarcely-used for the one in California at least — in fact San Francisco is one of the most common city names in the world, and in Spanish, S. Francisco may be more likely).
Sometimes the input to a geocoder is not a place or address at all, but a category of place e.g. “Restaurants in Brooklyn”. Libpostal 1.0 adds rudimentary support for parsing these types of queries. Namely there are two new tags for category for generic category words like “restaurants” in several languages, and a tag called near for expressions like “in”, “near”, “nearby”, “near me”, etc. Putting it all together, we can build tagged training examples like:
Restaurants/category in/near Brooklyn/city_district
The specific phrases in each language are bootstrapped from Nominatim’s Special Phrases list, which maps OSM tags (like amenity=restaurant) to phrases in various languages. So for addresses in OSM that have the tag amenity=restaurant, we can add one example for the address itself and another for the category query. The category queries are supported in ~30 languages, with varying coverage. We also haven’t built training data for cuisine names or nuanced categories of places yet, but anyone can edit the per-language config files, and we’d love contributions.
A subset of the category query is the chain store query e.g. “Walmarts in NYC” (don’t search for that, there are none). We obtained an international list of chain stores/brands by taking the most frequent venue names in OSM, combing through the top few thousand, and adding them to the new “chains.txt” dictionary in libpostal. Whenever we see a known chain as a venue name, we can create a generic query for it such as:
Walmarts/house in/near NYC/city
In this case the venue name would have been “Walmart” but we also add plural forms to the English dictionaries.
Features in machine learning are simply attributes of the input that help us predict the output. In address parsing, our input starts out as simply a sequence of words — that’s it. Contrary to popular belief, the libpostal parser has no notion of country or language baked into it, no complex rule sets or giant if/then blocks. We need to be able to transform our text into a vector of numbers that can be plugged into equations along the lines of y = wx + b. Here the y’s are our output labels (house_number, road, etc.), b is an intercept term, the w’s are the weights that our model will learn by trying to minimize its own errors, and the x’s are features we will define.
Initially, the features used were fairly similar to what one might see in a part-of-speech tagging model: things like previous/current/next word with some normalizations and grouping of known phrases, bigrams, previous labels, etc. I described the more generic NLP features in some detail in this Github question. However, in 1.0 it was advantageous to extract several new types of features to help with the parser’s most common confusion classes.
N-grams for rare/unknown words
Even with a billion addresses, we don’t have the luxury of assuming that the data is complete. There will inevitably be words the parser encounters in user input that it has never seen before, often venue names, sometimes words in morphologically rich languages like Finnish or Turkish.
We wanted to simulate these unknown words in the training data so the parser could have an idea of what to do with them. This was accomplished by simply treating words that occurred fewer than, say, 5 times as unknown or out-of-vocabulary.
However, we still wanted to capture some information about these unknown words, so we used 3–6 character n-gram sub-sequences, distinguishing the “edge” n-grams at the beginning and end of the word from those in the middle. This was done for each word that occurs ≤ 50 times in the training data. What’s nice about this approach is the simulated unknown words (occurring ≤ 5 times) are not simply aliased to a common unknown word token. Instead, they can share parameters and thus statistical strength with known, but rare, vocabulary words (i.e. those in the 5–50 occurrences range) that have overlapping character sequences. That way when the parser encounters a true unknown word in the wild, say a misspelling, it is still able to use n-gram information if the n-grams have been seen before.
Note that we only extract n-grams for non-numeric words that are not part of a known place name i.e. it’s mostly for words that are part of rare street names and venue names.
For hyphenated words (common in e.g. German street addresses) there’s a new feature for each distinct word in the hyphenated phrase if that word is in the model’s vocabulary. This is similar to the n-gram features but may yield longer/more complete words whose de-hyphenated versions can also be found in the training data.
As mentioned, by far the most common error the original parser would make was confusing street names with venue names and vice versa. Consider the following example:
Barboncino/house 781/house_number Franklin/road Ave/road
Barboncino is a venue name, and like many venue names, it only occurs once in the training data (and thus is treated as an unknown word). It’s also the first word so we don’t have the model’s previous tag prediction to rely on, as we would if this happened later in the string, so essentially there’s not much for the parser to work with. If we only had to handle English, it might be easy for the parser to learn a strong positive weight for an unknown first word followed by a number being a venue name. However, when considering the rest of the world, where street usually comes before house number, there are enough counterexamples that predicting “road” might be the likelier bet without more information.
So, to help disambiguate, there’s a new feature which, in the case of an unknown first word, finds the relative position of a) the first number in the righthand context and b) the first thoroughfare type (from libpostal’s dictionaries) in the righthand context. This basically tells the model whether it appears that we’re in a house_number+road setting or a road+house_number setting.
The second most common error class the original parser would make was house_number vs. postcode. In the previous release, we would mask the digits of any number (“12345–6789”=>“DDDDD-DDDD”) unless it was a known postcode, in which case we’d keep the unique word. This can be problematic in cases like Kingston, Jamaica where the postcodes are single and double-digit numbers e.g. “Kingston 5”, “Kingston 12”, etc. If we separated every occurrence of the numbers 1–20 into a distinct feature per number simply because they might be postcodes in Kingston, those numbers would not be able to share statistical strength.
So we now precompute an index of postcodes and their admin contexts. The intuition is that something like “10001” is more likely to be a postcode if words/phrases like “New York”, “NY”, “United States”, etc. are immediately to its left or right. This means we don’t in fact have to keep a string feature around for every postcode. We simply map postcodes and toponyms to two different integer ID spaces and keep a bipartite graph of their co-occurrences. When a candidate postcode occurs, we look up whether there’s any supporting context for it in fact being a postcode. It’s also useful to have a negative feature which denotes a possible postcode with no support (more likely to be a house number or something else).
Conditional Random Fields FTW
Perhaps the biggest contributor to libpostal 1.0’s accuracy gains is its powerful machine learning model, a Conditional Random Field (CRF).
CRFs sometimes gets a bad rap in industry as overly-complex. Matthew Honnibal, author of the popular SpaCy package, espouses this view in an otherwise great post on part-of-speech tagging in Python. There he characterizes CRFs as “slow and complicated” while producing only trivial gains in accuracy. CRFs are a bit complicated to implement, but they’re by no means slow (particularly when trained with the averaged perceptron) and the accuracy gains on some tasks are anything but trivial. CRFs achieve near state-of-the-art results in a variety of sequence tagging tasks common in natural language processing such as part-of-speech tagging, chunking, named-entity recognition, and now address parsing!
The main advantage a CRF has over a greedy sequence model is the ability to revise its previous decisions.
A greedy model (computer science greedy, not money greedy) makes one decision per token and commits to it. It’s a linear model that takes the dot product of the observed features and the learned weights for those features and produces a score for each output label in the model (house_number, road, etc.). It then simply picks the top-scoring label and moves on to the next word with no regrets. Let’s call this the YOLO method of parsing.
The CRF, instead of scoring each token, finds the best scoring sequence of decisions, and can revise a previous decision if doing so would help the next word in the sequence score higher. We’ll call this the Spock method of parsing (the good of the many outweighs the good of the few, or, indeed, the one). Let’s illustrate the differences with some pseudocode:
Greedy parsing (YOLO method)
Take the “Barboncino 781 Franklin Ave” from before as an example. This is a simplification of what the greedy parser’s state looks like:
tokens = [“Barboncino”, “DDD”, “Franklin”, “Ave”]
current_index = 0
previous_prediction = NULL
The first word, “Barboncino”, is unknown to the parser (as mentioned, we create artificial unknown words by treating anything that occurred fewer than 5 times in the training data as out-of-vocabulary), so it essentially just sees an unknown word followed by a three-digit number.
If the training data only contained English/American addresses, that parse would be pretty easy for the model to learn. An unknown word followed by a number is almost always a venue name. However, when you add street addresses from most of the rest of the world, where street comes before house number, the model starts to be more confident that an unknown word followed by a number is an obscure street name. Using n-grams of the unknown words helps to a degree, but in this case many of the character sequences in Barboncino also overlap with Italian street names.
So it scores each possible class that “Barboncino” could be, and, probably a narrow margin, predicts “road”, at which point the state looks like:
tokens = [“Barboncino”, “DDD”, “Franklin”, “Ave”]
current_index = 1
previous_prediction = “road”
and it moves on to the next prediction without looking back. YOLO! “Barboncino” is now considered a road, and that (alternative) fact will be used in classifying the next token “781”, but the first prediction is already wrong. This can lead to bad/confusing sequences of predictions.
CRF parsing (Spock method)
For the first token in the string, the CRF essentially works the same way as the greedy model. The state here is slightly different:
tokens = [“Barboncino”, “DDD”, “Franklin”, “Ave”]
current_index = 0
# Indexed as: "house", "house_number", "road"
scores = [4.0, 1.0, 5.0]
So initially the most likely class for this unknown word followed by a number would be “road”, just as before. However, when the CRF gets to the second token, things get interesting. The model scores an LxL matrix where L is the number of labels the parser can output. The score at i, j in said matrix represents the score for the current word being labeled class i and the previous word being labeled class j. So every label we could transition to by every label we could have transitioned from.
tokens = [“Barboncino”, “DDD”, “Franklin”, “Ave”]
current_index = 1
# Rows are previous tag, columns are current tag
# both are indexed as: "house", "house_number", "road"
scores = [[1.0, 9.0, 5.0],
[0.5, 0.1, 0.2],
[0.6, 0.1, 0.4]]
Now when the parser gets to “781”, the highest score in the matrix indicates that the current label should be “house_number” and the previous label should actually be “house” instead of “road”. As such, the CRF can revise its previous decisions if that would help the current token score better.
When making predictions for a sequence, a CRF store back-pointers which represent the highest scoring paths through the string. So if we make a decision at time t that changes time t-1, we’d still have access to “what was the best path if the token at t-1 had label k instead of j?” and that can potentially change the decisions at t-2, t-3, etc.
This is known as Viterbi inference, and it can be particularly effective in the case of address parsing in multiple languages. 🖖
A fast, scalable CRF implementation
Libpostal’s CRF is, to the best of my knowledge, the largest ever trained publicly. As such, existing CRF implementations like Wapiti and CRFsuite, both of which require the training set to fit in memory, were not an option on a billion examples. So libpostal implements its own CRF which combines some of the best ideas from previous implementations with our focus on scalability, memory efficiency, and multilingualism.
There are several ways to learn/optimize the weights for a CRF, the most popular being stochastic gradient descent and LBFGS. However, if we’re willing to sacrifice the ability to assign probabilities to our parses, perhaps the fastest way to train a CRF is using our old friend the averaged perceptron. The only major difference between this implementation and the greedy model is in the more powerful Viterbi inference described above, but all the other great things about the averaged perceptron remain intact. As before, the perceptron’s error-driven update procedure means we don’t need to update the weights except when we make a mistake, so it also enjoys the same sparsity properties as the greedy model.
CRFsuite’s implementation (the faster of the two mentioned above) formed much of the basis for libpostal’s, though we made a critical performance improvement. In particular, CRFsuite makes N² heap accesses worst case when storing the back-pointers during Viterbi inference, whereas the argmax transition for each label could be stored in a stack variable instead, requiring only N trips to the heap. Viterbi inference is the main event in averaged perceptron training of CRFs. After a few million examples the parser starts to make good predictions and it rarely needs to update the weights. It mostly just runs inference, verifies its results were correct, and moves on. But that still means performing inference several billion times so it can find and improve on the cases it gets wrong.
In practice, moving the argmax to the stack results in a ~2X performance improvement in Viterbi inference with the 20 labels currently used in libpostal (should be even more dramatic with larger tag sets as CRF inference is quadratic in the number of labels). This puts the CRF on par with the greedy averaged perceptron in terms of address throughput both at training time and at runtime.
In many CRF implementations, including CRFsuite, there are two types of features: state features (local attributes of the word or sequence of words — most of libpostal’s features fall under this category), and transition features (a matrix of weights for transitions between labels).
However, for libpostal we actually have many different languages and address formats, so raw transition weights aren’t as helpful without knowing something about the words and/or their surrounding context. So we also have the notion of state-transition features, which are the combination of the two e.g. “word=Franklin and previous label=house_number”. These existed in the greedy model as well, but in the CRF context the previous label is not a given so we’ll score the transitions out of every possible label. This being the case, the weights for the state-transition features can actually be condensed to sparse LxL matrices instead of creating L distinct features as strings and storing sparse vectors for each.
Some might notice that the new parser model is larger than the previous version. It’s trained on over 10x more data so we’d expect it be somewhat larger, but the model that ended up being published is about half the size it could have been without some sparsity tricks.
In a likelihood-based optimization setting like stochastic gradient descent, it’s possible to encourage sparse solutions with something like L1 regularization or the FTRL-Proximal method, an online regularization scheme used by Google. However, there’s not a clear application of the L1 penalty to perceptron learning.
Standard feature pruning i.e. removing features that occurred fewer than N times in the training data is also not ideal because sometimes rare words or features like n-grams can still be highly predictive. We also already do some pruning anyway when creating artificial unknown words, and the model was still going to end up as large as half of an average laptop’s main memory.
We instead opted for a method proposed by Goldberg et al. for learning sparser perceptrons by essentially keeping track of how many weight updates each feature participates in and ignoring features until they have been part of at least N updates. Features below the threshold are also ignored when making predictions during training, so the model can’t rely on information that it wouldn’t have if those features were eliminated. Note that since the perceptron only updates its weights when it makes a mistake, the features that get added to the model are those that occur in examples that the current model is getting wrong. This has the benefit of allowing rare features through as long as they prove their usefulness.
100GB of public training data
It has always been possible to generate libpostal’s training data from scratch but it’s a very resource-intensive process, so for 1.0 we’ve published the parser’s training data on S3 (see the libpostal repo for more details on how to download it).
With the more powerful CRF model, parser mistakes are more often related to errors in the training data. If the parser doesn’t perform as well as expected on a particular address, libpostal users can inspect the training data using simple Unix tools like grep and awk to try to determine if there’s some pattern/style that’s not being captured.
Supporting the libpostal project
Mapzen has been the primary sponsor of libpostal, although I’m an independent developer. As such, libpostal doesn’t have the same kind of continuous support as open-source projects developed entirely in-house by full-time employees at a company. Thus far I’ve been able to implement libpostal through a series of contracts for each milestone, but going forward I’d like to figure out a more sustainable development budget to maintain the project and continue to push the frontiers in geo NLP research.
If your company is using libpostal, especially as part of a paid offering, consider asking your organization to support the project financially. Libpostal is on OpenCollective, where you can sponsor the project by making a recurring monthly donation. As a sponsor, your company logo will appear on the repo page just below the introduction. Please consider joining Mapzen in supporting libpostal.
Ok, so we’re up to 99.45%. Is there still room for improvement? Always, although it’s probably less about inching toward 100% accuracy and more about expanding what that number means. This post has thus far been largely-technical and about solving various data issues with the map, but there are some important human topics to address too.
Within the mapping community, and the tech community in general, we have to face some uncomfortable facts. Despite laudable efforts and some outstanding exceptions, the core contributor base of OSM is still largely made up of white men who live in North America and Europe. This has a substantial impact on where and what in the world gets mapped. As Humanitarian OpenStreetMap Team founder Kate Chapman has said: “maps are an abstraction of reality, but who’s reality are we talking about?”
In well-mapped parts of the world, many aspects of diversity and inclusion come down to venues/POIs. Why are there still more distinct tags for types of sex clubs than child care centers? Why are bathrooms not tagged as being specifically usable by women? And why did someone feel that it was remotely acceptable to mansplain away the validity of this issue, as happened during a well-researched gender inclusion talk at State of the Map 2016? Everyone who edits maps should read the excellent Tagging in Support of Women and Girls resource on the OSM wiki. We also need to acknowledge that racism, like sexism and a whole lot of other isms, often lives in the NULL values. While each type of exclusion has its own nuances, they all rhyme, and I believe the kind of thinking that’s currently going in to rewriting sexist maps can be applied to proactively supporting other under-represented groups as well.
Much of libpostal was written at my favorite local coffee shop: Breukelen Coffee House. It’s a proudly Black-owned business, makes a great cold brew, has free wifi, and plays a lot of Erykah Badu (in very meta moments I’ve even heard a track that Ms. Badu did with D.R.A.M. called “Wifi”, which is everything). I searched for the cafe’s address in OSM, and didn’t find it. Practically every other business on its same avenue was listed, even a place that had only recently opened, so it wasn’t just a fluke or an old import. There are a number of great Black-owned eateries and bars in and around my neighborhood so I decided to look beyond Breukelen. I found there were at least a dozen places that were conspicuously missing from OSM despite other nearby businesses being represented, or were only in OSM because of an NYC buildings import and lacked basic details. So I added a number of places and created an experimental “black_owned=yes” tag. Still, it is disconcerting to see a view of the world that looks like a post from the blog “Stuff White People Like”.
There are thriving Black-owned businesses that were not on my local map. The same is true of venues owned by or serving other underrepresented groups: Muslims, the Latina/o community in the US, Indigenous peoples, LGBTIQ folks, immigrants, refugees, people with disabilities, and every intersection thereof. Representation matters and we have an obligation to ensure that people can see themselves and their reality in maps, that their voices are included, recognized, and valued.
That’s the direction my contributions to open data will be headed, and I hope yours will too.
Thanks for reading!