Building an address parser with spaCy

Applying Named Entity Recognition to identify addresses.

Swapnil Saxena
Globant
12 min readNov 8, 2021

--

Address data for an organization is often vital in gathering customer analytics or supporting business operations such as marketing, logistics, delivery, and business correspondence. Cleansed, well parsed, standardized, and validated addresses make the base for data consolidation & analytics engines.

The bulk of enterprise address data can be found in the form of raw address strings manually keyed into a database or flat files. But is it consumable? Here’s how a unique US address can be written in different ways:

111 8th Ave Ste 1509 Tulsa OK 74136 US

C/o John Doe LLC, 111, 8th Ave Ste 1509, Oklahoma, 74136-1922, USA

111, 8th Ave Ste 1509, Tulsa, OK, , USA

Pretty messy. Right? 😳

Because the code you wrote with Python regex just can’t handle human absurdities.

And so, it becomes imperative to pre-process the data by parsing, de-duping, standardizing (mapping to standard names & filling in missing pieces), geo-tagging, etc. before it can be consumed for further analytics. Address parsing is one of these several pre-processing steps which helps to identify & segment an address string into different components such as Recipient, Building, Street, State, County, Postal code, and other such applicable components for that particular country.

It’s obvious that this class of problems simply can’t be addressed (pun intended😉) by writing traditional rule-based (often regular expression-driven) algorithms. We need more sophistication and this is where Natural Language Processing (NLP) algorithms come to the rescue.

What’s spaCy and why should I care?

Yeah, it’s spaCy (That’s how it’s written!). As introduced on its Wiki page,

spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.

NLP does what humans can but traditional algorithms can’t: learn and improve. We build and train our language processing model to identify patterns with the implied context in a sentence, passage, or even a novel. From the plethora of machine learning libraries available, spaCy is one such “knight in shining armor” that does the job with minimal efforts and compute resources put into the, otherwise overwhelming, model building and training process.

At its core, spaCy uses Thinc, a deep learning library, which is optimized for CPU usage (often an adoption constraint) and tackles specialized NLP tasks such as tokenization, lemmatization, part-of-speech (POS) tagging, text classification, named-entity recognition, and many others. Without expanding into each one of these techniques, we’d limit our discussion to Named-entity recognition which is relevant to our address parsing use case.

Named-entity recognition (NER) & spaCy

Named-entity recognition (NER) is described as

a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.

Simply explained, we attempt to look for “real-world objects” relevant to specific domain categories while scanning over a text passage or sentence and highlight them when found. All this is done mainly through intuition and contextual understanding(acquired during the model training process) of text without writing any programmatic rules. Here’s what a NER model execution over a text passage can reveal:

NER output as generated by displaCy visualizer. Courtesy: spaCy NER usage guide

We get a neat representation of different entities such as Organizations, Geo locations, dates, person names, etc. identified. Quite powerful, huh?

spaCy provides an out-of-box NER feature as part of its pre-trained pipelines so that you don’t necessarily have to go through the steps of building a model (although custom model designing is always an option) from scratch: identify a neural network architecture, add layers, initialize/adjust weights, etc.

The build-and-train process to create a statistical NER model in spaCy is pretty simplified and follows a configuration driven approach: we start with a pre-trained or empty language model, add an entity recognizer, optionally define custom entities, start the iterative training loop over our training set, and with few adjustments in the training set and config we obtain an optimal model. Let’s go over these steps in the coming sections and build our address parser.

Applying NER to address parsing

As you may have gathered by now, our task of segmenting an address string into different components falls into the ambit of named-entity recognition. We’d like to see our NER model be capable of parsing any address string with reasonable accuracy, something of this sort:

Tagged entities in an address string

So, let’s get started. We’ll follow along the training process, detailed here, to create our model for parsing US addresses.

spaCy installation: spaCy package can be installed using pip as below:

Custom entity labels: Specific to US addresses, we identify below custom entity labels for our model:

[‘STREET’, ‘RECIPIENT’, ‘BUILDING_NUMBER’, ‘BUILDING_NAME’, ‘ZIP_CODE’, ‘CITY’, ’STATE’, ‘COUNTRY’]

Training dataset preparation: We prepare our training dataset in a raw CSV format, limiting it to a good representative sample of address data in our source systems. A random 80:20 split of data (for our case study, we chose around 100+20 data training and validation data points) into training and validation dataset is generally recommended.

Sample training dataset for US addresses

This raw dataset, however, needs to be converted into spaCy’s DocBin format before consumption for training. Here’s a quick walk-through of how this is done in our code.

i) We start with pre-processing our address strings to get rid of extra spaces and newline characters. Depending on the source data, a few extra data massaging steps might need to be added.

ii) Next, we derive entity spans (start and end positions for an entity) for each of the address strings from our training/validation dataset.

Here’s how these would look like for a few data points:

iii) Finally, we initialize a DocBin object with this data. This would be persisted in the form of a .spacy corpus file - one each for the training and validation dataset.

Training configuration: Before we can kick off the training process, we need to prepare a training configuration with all the essential parameters. Let’s create a minimal training skeleton config file as below.

With the above configuration, we define a training pipeline using a blank English language model. Our pipeline contains a single module i.e., NER which would be trained. We also initialize the training batch size and other relevant parameters. Read more about the training configuration setup here.

Next, we run the below console command to create a final elaborated config file.

Let’s take a quick glance at the generated config file which has the full blueprint of our model and training process.

We can find the parser and tokenizer architectures defined in the form of spaCy’s powerful pre-trained models. As for training parameters, we notice the optimizer configurations viz. Adam optimizer, learning rate, evaluation frequency, no of epochs covered as well. Some of these configurations can be overridden during the training run, as we’ll see in the next section.

Training process: Alright, enough of the code and setup stuff! Let’s kick off our training pipeline. With minimal resources at our disposal (no GPU, but a modest Quad-core Intel i7 CPU & 16gigs of RAM!), we fire the train command which triggers the training process to generate our model in just about a minute!

Let’s walk through the above console output. The train command sets off the training loop of spaCy which generates the pipeline, initializes model weights, and iteratively goes through a cycle of adjusting weights, checking losses and evaluating model accuracy against the validation dataset. Notice how the performance metrics: precision, recall, and f-score move towards a perfect 100 score with each passing cycle. With a larger and much diverse training/validation set, though, these metrics would generally converge towards the perfect score, but may not necessarily get there.

After reaching the configured threshold of about 300 steps, the training process stops, and two model images are saved to disk: best (with a maximum score against validation) and last (obtained in the last epoch cycle).

Predictions: It’s the moment of truth! Let’s see how our model performs over a few unseen address strings.

Not bad at all! 😀 Except for a few address patterns, our model works reasonably well to identify most of the entities accurately in the address string. As we add more training examples to update our model and start extracting other address entities (say Apartment number, PO Box, etc.), we’d observe better prediction results.

Bonus Goodie: Coupling NER with Pattern matching

Although a well-trained model would certainly give a boost to our address data parsing capability, we may still see some whimsical predictions here and there. While there may not always be an easy way out of these, spaCy does come with provision to reinforce models through a set of pattern-based rules, covered via its Entity Ruler.

Let’s run through another address string for our parser.

Quite erratically, we see Oklahoma being identified as a City rather than a State! We’d deduce this as an outcome of using training address data which only had two-lettered state codes but not the expanded state names. One way to get around this would be through adding more such patterns in training data.

Alternatively, we can create pattern-based regex rules to handle these. Let’s create a pattern file with the list of all US states.

To make use of these pattern rules in our model, we’d modify our training configuration to add entity-ruler as an additional module along with pattern file reader settings.

We follow the same process, yet again, to generate the full config file and train through it to generate a rule augmented prediction model.

Let’s see the results.

Perfect! With the careful addition of more such rules(Counties, Country naming patterns), we can further improve the accuracy of our model.

Final Thoughts

Using spaCy’s powerful NLP-NER capabilities, augmented with its unique rule engine offering, we have demonstrated how easily (almost zero code and largely configuration driven training process!) and economically(no GPU requirements), a minimal address parsing implementation can be built.

While this implementation in itself is not sufficient to handle varied address data semantics, a good data pre-processing strategy coupled with larger, diverse training sets is expected to yield better results as against working through a pure rule-based parsing approach.

In this case study, we limited our scope exclusively to US addresses. However, another interesting extension to the address parsing problem would be to resolve the country for an address (before running through the country-specific parser) from a mixed dataset of different geographies. Perhaps, a suitable use case to explore with spaCy’s tokenizer and sklearn’s multiclass algorithm. But, let’s save that for another blog post!

--

--