Parsing postal addresses with a transformer

Published in

continuity-tech

11 min readAug 27, 2021

Context

At Continuity we are developing a solution that combines AI and Open Data to support SME insurers and help them guarantee that each company is properly covered at each step of its existence.

In our everyday processing pipeline we process large lists of addresses received from our clients. These addresses will more often than not have been entered manually and will regularly include extraneous parts and/or various typos and abbreviations.

To provide accurate insights to our clients, we need to be able to rely on our pipelines to properly identify and geocode addresses with high confidence. Here we give some insights as to how we do this.

Geocoding

At the time of writing, we are mainly focused on French addresses. This means that for the geocoding part we can leverage the official address geocoding API. It is available here . It is reliable (mostly), fast and precise even though it suffers from a few “mishaps”:

https://twitter.com/vincevlo/status/1369593923765739523

The technology (addok) and database (Base Adresse Nationale) used by the API are open source, you can read about it here (In French). It is a rule based engine highly tuned to French address parsing.

To check out a typical answer from the API:

Answer from https://api-adresse.data.gouv.fr/search/?q=1%20rue%20de%20la%20victoire%20paris

As you can see it returns a typical GeoJSON with FeatureCollection type with a list of features ordered by descending relevancy score. The first answer is pretty easily trustable with a score of 0.97.

But what happens if queries are less “clean”. For example what do we get when we query an address that is still valid but contains “extra” information. Let’s try with “mr toto 1 rue de la victoire bp 795 paris”, which is the same valid address, but with 1/ the name of a person and 2/ a po box (“bp”). The name and po box are both fake, but it is not relevant to the demonstration as this information is not parsed or even queried by the API.

The response is still the good address, but the score has dropped to 0.58:

response for “mr toto 1 rue de la victoire bp 795 paris”

Now, let’s query with an invalid address, say “1 avenue victoire paris” which references a street that doesn’t actually exist (notice “avenue” instead of “rue”).

In this case, the API’s response is again the same, but the score this time is 0.53:

Agreed, in this example, 0.58 is greater than 0.53, but not by much, and often an invalid address might have a higher score. To automate our processing as much as possible, we’d like to have clear cut answers with high scores for valid addresses and lower scores for invalid addresses.

This exemplifies one of our main issues with the addresses we receive. A common way to solve this is by pre-parsing input addresses to extract meaningful parts so as to pre-format and pass only these meaningful parts to the API.

Furthermore, we have other steps in our pipeline that benefit from proper parsing.

Finally some stats on a test dataset of about 300k entries. The overall coverage is 91.6% (without filtering on score), which is already really good. All entries with a score above 0.7 are correct. We chose this value, empirically, as our cutoff for trusting the output as correct.

By coverage, we mean “percentage of properly geocoded addresses within our dataset”. That is, the number of entries for which we find the correct location.

As you can see below though, the histogram of the distribution of the scores returned by the API presents a large blob around 0.5. As mentioned above, we’d like to be able to have the ability to more easily separate invalid addresses from valid addresses.

Histogram of scores returned by the geocoding API for 300k example addresses

Parsing first try

Hopefully, to help the API, one can imagine preprocessing the addresses to remove parts that negatively impact scoring (or even make the API fail completely).

For our first try, we tested with an open source address parsing tool called libpostal.

It is highly versatile and works for all countries off the shelf. You can read all about it from the author here and here.

We wrote a minimalistic wrapper in python around it using the pypostal library.

Here is an example of how it parses addresses:

We then extract fields of interest (house_number, road, etc), concatenate them and then query the geocoding API as before with this new “cleaned” address.

The new score histograms on our test set after libpostal parsing are much better :

Histogram of scores returned by the geocoding API for 300k example addresses after preprocessing via libpostal

As you can see, the blob in the middle has been reduced significantly in size and there is a nice peak above 0.9. There are about half the total entries in the dataset that previously had a score below 0.7 that now receive a score above 0.7 post libpostal processing. Also a little over 25% now obtain a score above 0.9 for which the score used to be below 0.7. This is a significant improvement when it comes to separating good from bad addresses, it would seem.

But, if we look more closely it only actually increases the overall coverage by 0.2%, which is not as much as we’d hoped for.

So, what’s going on? Well, libpostal is wonderful, but not perfect, and sometimes has issues properly extracting the correct fields. Here are some examples where libpostal fails and thus impacts negatively the final outcome:

Also, in a couple of these examples, not using libpostal is actually better, but we have no way to determine this a priori in a large scale setting.

Parsing new approach

Next, we tried building our own address parser.

What address parsing is, at it’s core, is very similar to a named entity recognition (NER) problem. That is, the problem of assigning a label to various parts of an unstructured text. So, naturally, with current SOTA being what it is, one would immediately think “Transformer”. But using a full fledged 300 million+ parameter behemoth seems a little over the edge, so we’ll build our own mini-transformer.

What is a transformer

A transformer is a deep neural-network model architecture that has taken over NLP in the last two years and that is well on it’s way to taking over other fields as well. It’s a type of model that works on a sequence of inputs and applies what is called “attention” to provide context to all parts of the sequence. They first came to prominence at the end of 2019 when BERT was released (see https://arxiv.org/abs/1810.04805).

If you are curious, a quick google search will surface a lot of relevant information. Some initial reads can be the wikipedia page or the seminal paper “Attention is all you need” https://arxiv.org/abs/1706.03762. There are also a large number of blog posts on the subject, for example What is a Transformer? or The Illustrated Transformer and many more.

from “Attention is all you need” https://arxiv.org/abs/1706.03762

The goto library that pretty much everyone uses to work with transformer models is Huggingface’s library called … transformers. It is the Swiss army knife for working with these models, and, I’m pretty sure, will be able to make your coffee soon. At the very least I’d like to buy a coffee for the team at Huggingface one of these days…

I’m not going to go into all the details of what you can do with them. I am going to focus on how we’ve used it to build our own parsing model.

Our model

Most transformer models you can work with are rather large. That means they are rather slow and memory consuming. Also, addresses don’t show the complexity and richness of variations that a typical language model will need to deal with. So we decided to build our very own “tiny” transformer and train it from scratch.

To do so, the main steps are:

define a dataset
train a custom tokenizer
build a model architecture
train the model
profit!

Dataset

We will focus on French addresses for the time being. All official French addresses are listed in the “Base Adresse Nationale” (BAN) which is open data and can be downloaded in csv format from here,The list of fields is available here.

The BAN database contains more the 25 million different addresses. Since it is supposed to contain all valid addresses in France, and all entries are labeled, it’s a good training set for our problem.

There is another open dataset of French addresses called BANO sourced by OpenStreetMap (https://prev.openstreetmap.fr/bano). It is smaller than BAN (15M addresses vs 25M for BAN) but has extra contributions not available in BAN. We expect the overall performance of our model should be the same whether it be trained on either dataset. So we went with the bigger one, BAN.

We also have about 30k really hard annotated entries that were sourced from various examples that we sprinkle in during training.

As stated earlier, we are dealing with an NER-type problem, thus we need to label the different parts of each address.

Officially, an address is to be decomposed into 5 parts, which are the 5 fields extracted from the BAN database (numero, rep, nom_voie, code_postal, nom_commune):

Some are optional, like the rep field, some are not, like code_postal.

The BAN is an ideal representation of the way addresses are written, in real life data, there are a few extra fields which we will need to add to the list of available tags:

sans_label: for parts that are not actually formatting for the address (translates to “no label”);
cedex: a common address modifier which indicates the postcode designates a specific company and gives mail a higher priority;
arrondissement: sometimes added to addresses for the three main cities (Paris, Lyon and Marseilles);
bp: for post boxes.

Also, quite often, random parts will be missing from inputs. There can also be typos and/or shorthand spellings.

And finally, for the NER task itself, we add BOS and EOS tags at the the beginning and end, respectively, of each entry, as placeholders to identify beginning and end parts of the address.

During training, we iterate over all entries in the dataset and transform each entry from BAN on the fly randomly by applying a set of rules. Examples of these rules are:

Add cedex with small probability;
Add bp with small probability;
remove certain parts according to certain rules;
rename road type with probability 0.5 by random sampling from the full list of available road types;
add random text drawn from a random list of words with certain rules;
sometimes remove dashes and replace them with spaces;
add arrondissement when town is one the three that have arrondissements with a random choice of used syntaxes;
shuffle parts under certain conditions.

Under these transformations, the address above can be randomly transformed into:

Examples of changes applied to input address “43 bis Rue de l’Ecole 17590 Saint Clément des Baleines” before training

Tokenizer

The tokenizer is trained using a ByteLevelBPETokenizer from HuggingFace's tokenizer library on all lowercased addresses from BAN with an output vocabulary size of 10k.

A great guide as to how to do this from scratch is available here.

We limit the size of the vocabulary to 10k as inputs will be small (addresses tend to be less than 10/15 words long) and it helps reduce the overall size of the model.

We also experimented with a vocabulary size of 1k, but it impacted performance negatively.

Model architecture

As stated above, we want something small. For several reasons:

training time;
memory consumption;
inference time.

Indeed, we want something that can compete with libpostal performance wise and that doesn’t use too much memory (FYI, libpostal uses approx 2GiB of RAM).

We start out with the basic Roberta architecture for a 10k token vocabulary and reduce it by setting:

number of layers to 6;
number of attention heads to 8;
hidden size to 128;

for everything else defaults are kept.

We add on top a simple one layer fully connected classification head.

If you are curious as to how to load an empty transformer from scratch using the transformer library, here is how we do it:

>>> from transformers import AutoConfig, AutoModel
>>>
>>> config = AutoConfig.from_pretrained('roberta-base')
>>> config.num_hidden_layers = 6
>>> config.num_attention_heads = 8
>>> config.hidden_size = 128
>>> config.vocab_size = 10000
>>>
>>> transformer = AutoModel.from_config(config)
>>>
>>> transformer.config
RobertaConfig {
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "eos_token_id": 2,
  "gradient_checkpointing": false,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 128,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 8,
  "num_hidden_layers": 6,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.8.2",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 10000
}

The transformer is thus loaded and trainable.

Train Model

Our model was trained using typical parameters:

CrossEntropy loss;
AdamW optimizer with learning rate of 1e-4.

Nothing special to report here. Training was performed on a rig with 2x GeForce RTX 2080 SUPER GPUs and took about 100 hours. We trained using a typical training/validation split with several validations per epoch. Training was halted using typical early stopping method, i.e. when model started overfitting on training set vs validation set. Also, generating all examples on the fly could very well induce a slowdown as each batch step is waiting for data to be available.

Why was it so slow for such a small model, you might wonder? The main reason was how the training and validation datasets were generated. Even though the source addresses were always the same in each dataset, the on the fly random process described above means the model rarely saw the exact same example twice.

Profit! (or results)

To exploit our model in production, we wrap it in a simple python API. Decoding the output sequence from the model is done straightforwardly, it works without the need to rely on beamsearch for instance.

Memory wise, libpostal consumes a little less than 2 GiB of RAM. Loaded our model uses around 250 MiB. That is a great gain in situations were memory might be an issue.

On the same (slow) CPU hardware, querying 500 single addresses takes 35s with our libpostal API vs 75s with the new transformer API. Which means we are about 3 times slower than libpostal. But, we can easily batch queries to the model internally so as to reduce query time to 27s which makes up for the slowdown.

Finally, let’s test the examples from earlier:

It’s quite clear that all our examples are now properly parsed. This generalizes quite well.

The histogram is very similar to the one resulting from libpostal preprocessing:

Histogram of scores returned by the geocoding API for 300k example addresses after preprocessing via our model

But overall, the total coverage has reached 97.6%! So that’s a gain of a little over 6% overall. Which is quite impressive at this level.

Conclusion

Having the highest possible coverage for each step in our processing pipeline is important because losing +10% of usable data per step means that we might end up having very little meaningful data after only a few steps.

We’ve demonstrated here that we can increase our coverage in one of our primary steps by applying a simple, but novel, model trained on open data that increases performance with respect to common tools like libpostal.

A major drawback at the moment is that we are limited to a single country, but applying the same logic to other countries should not be a huge strain.