Building an address parser with spaCy
Applying Named Entity Recognition to identify addresses.
Address data for an organization is often vital in gathering customer analytics or supporting business operations such as marketing, logistics, delivery, and business correspondence. Cleansed, well parsed, standardized, and validated addresses make the base for data consolidation & analytics engines.
The bulk of enterprise address data can be found in the form of raw address strings manually keyed into a database or flat files. But is it consumable? Here’s how a unique US address can be written in different ways:
111 8th Ave Ste 1509 Tulsa OK 74136 US
C/o John Doe LLC, 111, 8th Ave Ste 1509, Oklahoma, 74136-1922, USA
111, 8th Ave Ste 1509, Tulsa, OK, , USA
Pretty messy. Right? 😳
And so, it becomes imperative to pre-process the data by parsing, de-duping, standardizing (mapping to standard names & filling in missing pieces), geo-tagging, etc. before it can be consumed for further analytics. Address parsing is one of these several pre-processing steps which helps to identify & segment an address string into different components such as Recipient, Building, Street, State, County, Postal code, and other such applicable components for that particular country.
It’s obvious that this class of problems simply can’t be addressed (pun intended😉) by writing traditional rule-based (often regular expression-driven) algorithms. We need more sophistication and this is where Natural Language Processing (NLP) algorithms come to the rescue.
What’s spaCy and why should I care?
Yeah, it’s spaCy (That’s how it’s written!). As introduced on its Wiki page,
spaCy is an open-source software library for advanced natural language processing, written in the programming languages Python and Cython.
NLP does what humans can but traditional algorithms can’t: learn and improve. We build and train our language processing model to identify patterns with the implied context in a sentence, passage, or even a novel. From the plethora of machine learning libraries available, spaCy is one such “knight in shining armor” that does the job with minimal efforts and compute resources put into the, otherwise overwhelming, model building and training process.
At its core, spaCy uses Thinc, a deep learning library, which is optimized for CPU usage (often an adoption constraint) and tackles specialized NLP tasks such as tokenization, lemmatization, part-of-speech (POS) tagging, text classification, named-entity recognition, and many others. Without expanding into each one of these techniques, we’d limit our discussion to Named-entity recognition which is relevant to our address parsing use case.
Named-entity recognition (NER) & spaCy
Named-entity recognition (NER) is described as
a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Simply explained, we attempt to look for “real-world objects” relevant to specific domain categories while scanning over a text passage or sentence and highlight them when found. All this is done mainly through intuition and contextual understanding(acquired during the model training process) of text without writing any programmatic rules. Here’s what a NER model execution over a text passage can reveal:
We get a neat representation of different entities such as Organizations, Geo locations, dates, person names, etc. identified. Quite powerful, huh?
spaCy provides an out-of-box NER feature as part of its pre-trained pipelines so that you don’t necessarily have to go through the steps of building a model (although custom model designing is always an option) from scratch: identify a neural network architecture, add layers, initialize/adjust weights, etc.
The build-and-train process to create a statistical NER model in spaCy is pretty simplified and follows a configuration driven approach: we start with a pre-trained or empty language model, add an entity recognizer, optionally define custom entities, start the iterative training loop over our training set, and with few adjustments in the training set and config we obtain an optimal model. Let’s go over these steps in the coming sections and build our address parser.
Applying NER to address parsing
As you may have gathered by now, our task of segmenting an address string into different components falls into the ambit of named-entity recognition. We’d like to see our NER model be capable of parsing any address string with reasonable accuracy, something of this sort:
So, let’s get started. We’ll follow along the training process, detailed here, to create our model for parsing US addresses.
spaCy installation: spaCy package can be installed using pip as below:
pip install -U spacy>>> import spacy
>>> spacy.__version__
'3.1.2'
Custom entity labels: Specific to US addresses, we identify below custom entity labels for our model:
[‘STREET’, ‘RECIPIENT’, ‘BUILDING_NUMBER’, ‘BUILDING_NAME’, ‘ZIP_CODE’, ‘CITY’, ’STATE’, ‘COUNTRY’]
Training dataset preparation: We prepare our training dataset in a raw CSV format, limiting it to a good representative sample of address data in our source systems. A random 80:20 split of data (for our case study, we chose around 100+20 data training and validation data points) into training and validation dataset is generally recommended.
This raw dataset, however, needs to be converted into spaCy’s DocBin format before consumption for training. Here’s a quick walk-through of how this is done in our code.
i) We start with pre-processing our address strings to get rid of extra spaces and newline characters. Depending on the source data, a few extra data massaging steps might need to be added.
def massage_data(address):
'''Pre process address string to remove new line characters, add comma punctuations etc.''' cleansed_address1=re.sub(r'(,)(?!\s)',', ',address) cleansed_address2=re.sub(r'(\\n)',', ',cleansed_address1) cleansed_address3=re.sub(r'(?!\s)(-)(?!\s)',' - ',cleansed_address2) cleansed_address=re.sub(r'\.','',cleansed_address3) return cleansed_address
ii) Next, we derive entity spans (start and end positions for an entity) for each of the address strings from our training/validation dataset.
get_address_span(address=None,address_component=None,label=None):'''Search for specified address component and get the span. Eg: get_address_span(address="221 B, Baker Street, London",address_component="221",label="BUILDING_NO") would return (0,2,"BUILDING_NO")''' if pd.isna(address_component) or str(address_component)=='nan':
pass
else:
address_component1=re.sub('\.','',address_component)
address_component2=re.sub(r'(?!\s)(-)(?!\s)',' - ',address_component1)
span=re.search('\\b(?:'+address_component2+')\\b',address)
return (span.start(),span.end(),label)
Here’s how these would look like for a few data points:
(19 ST ANDREW ST, BULRINGTON, VT, 05401, , United States, [(0, 2, BUILDING_NO), (3, 15, STREET_NAME), (33, 38, ZIP_CODE), (17, 27, CITY), (29, 31, STATE), (42, 55, COUNTRY)])
(2574 EAST 23RD STREE, CHATTANOOGA, TN 37404, United States, [(0, 4, BUILDING_NO), (5, 20, STREET_NAME), (38, 43, ZIP_CODE), (22, 33, CITY), (35, 37, STATE), (45, 58, COUNTRY)])
(5931 W ANGELA RD, MEMPHIS, TN 38120, United States, [(0, 4, BUILDING_NO), (5, 16, STREET_NAME), (30, 35, ZIP_CODE), (18, 25, CITY), (27, 29, STATE), (37, 50, COUNTRY)])
(3812 MYERS STREET, GREENEVILLE, TN 37743, United States, [(0, 4, BUILDING_NO), (5, 17, STREET_NAME), (35, 40, ZIP_CODE), (19, 30, CITY), (32, 34, STATE), (42, 55, COUNTRY)])
iii) Finally, we initialize a DocBin object with this data. This would be persisted in the form of a .spacy corpus file - one each for the training and validation dataset.
def get_doc_bin(training_data,nlp):'''Create DocBin object for building training/test corpus'''
# the DocBin will store the example documents
db = DocBin()
for text, annotations in training_data:
doc = nlp(text) #Construct a Doc object
ents = []
for start, end, label in annotations:
span = doc.char_span(start, end, label=label)
ents.append(span)
doc.ents = ents
db.add(doc)
return db
.
.
.###### Training dataset prep ########### # Read the training dataset into pandas df_train=pd.read_csv(filepath_or_buffer="./corpus/dataset/us-train-dataset.csv",sep=",",dtype=str) # Get entity spans
df_entity_spans= create_entity_spans(df_train.astype(str),tag_list) training_data= df_entity_spans.values.tolist()
.
..
# Get & Persist DocBin to disk
doc_bin_train= get_doc_bin(training_data,nlp) doc_bin_train.to_disk("./corpus/spacy-docbins/train.spacy") ######################################
Training configuration: Before we can kick off the training process, we need to prepare a training configuration with all the essential parameters. Let’s create a minimal training skeleton config file as below.
[components]
[components.ner]
factory="ner"[nlp]
lang = "en"
pipeline = ["ner"][training]
[training.batch_size]
@schedules = "compounding.v1"
start = 4
stop = 32
compound = 1.001
With the above configuration, we define a training pipeline using a blank English language model. Our pipeline contains a single module i.e., NER which would be trained. We also initialize the training batch size and other relevant parameters. Read more about the training configuration setup here.
Next, we run the below console command to create a final elaborated config file.
python -m spacy init fill-config config\base_config.cfg config\config.cfg
Let’s take a quick glance at the generated config file which has the full blueprint of our model and training process.
.
.
.
.[components][components.ner]
factory = "ner"
incorrect_spans_key = null
moves = null
update_with_oracle_cut_size = 100[components.ner.model]
@architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null[components.ner.model.tok2vec]
@architectures = "spacy.HashEmbedCNN.v2"
pretrained_vectors = null
width = 96
depth = 4
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true.
.
.
.[training.optimizer]
@optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001[training.score_weights]
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
ents_per_type = null.
.
.
.
.
We can find the parser and tokenizer architectures defined in the form of spaCy’s powerful pre-trained models. As for training parameters, we notice the optimizer configurations viz. Adam optimizer, learning rate, evaluation frequency, no of epochs covered as well. Some of these configurations can be overridden during the training run, as we’ll see in the next section.
Training process: Alright, enough of the code and setup stuff! Let’s kick off our training pipeline. With minimal resources at our disposal (no GPU, but a modest Quad-core Intel i7 CPU & 16gigs of RAM!), we fire the train command which triggers the training process to generate our model in just about a minute!
python -m spacy train config\config.cfg --paths.train corpus\spacy-docbins\train.spacy --paths.dev corpus\spacy-docbins\test.spacy --output output\models --training.eval_frequency 10 --training.max_steps 300ℹ Saving to output directory: output\models
ℹ Using CPU=========================== Initializing pipeline ===========================
[2021-09-11 18:30:04,925] [INFO] Set up nlp object from config
[2021-09-11 18:30:04,925] [INFO] Pipeline: ['ner']
[2021-09-11 18:30:04,941] [INFO] Created vocabulary
[2021-09-11 18:30:04,941] [INFO] Finished initializing nlp object
[2021-09-11 18:30:05,141] [INFO] Initialized pipeline components: ['ner']
✔ Initialized pipeline============================= Training pipeline =============================
ℹ Pipeline: [ 'ner']
ℹ Initial learn rate: 0.001
E # LOSS NER ENTS_F ENTS_P ENTS_R SCORE
--- ------ -------- ------ ------ ------ ------
0 0 62.71 6.18 4.13 12.21 0.06
0 10 808.79 0.00 0.00 0.00 0.00
0 20 468.03 23.00 33.33 17.56 0.23
1 30 286.96 44.55 58.75 35.88 0.45
1 40 348.01 75.10 75.38 74.81 0.75
2 50 254.44 76.56 78.40 74.81 0.77
2 60 244.69 82.11 87.83 77.10 0.82
3 70 115.05 91.12 92.19 90.08 0.91
3 80 61.84 94.57 96.06 93.13 0.95
4 90 99.36 98.47 98.47 98.47 0.98
4 100 29.16 98.47 98.47 98.47 0.98
5 110 34.48 98.08 98.46 97.71 0.98
5 120 27.44 98.08 98.46 97.71 0.98
6 130 16.41 98.08 98.46 97.71 0.98
6 140 15.01 98.85 99.23 98.47 0.99
7 150 5.20 100.00 100.00 100.00 1.00
7 160 9.05 100.00 100.00 100.00 1.00
8 170 2.39 100.00 100.00 100.00 1.00
9 180 0.77 100.00 100.00 100.00 1.00
9 190 3.11 100.00 100.00 100.00 1.00
10 200 3.14 100.00 100.00 100.00 1.00
10 210 1.31 100.00 100.00 100.00 1.00
11 220 3.40 100.00 100.00 100.00 1.00
11 230 0.05 100.00 100.00 100.00 1.00
12 240 1.55 100.00 100.00 100.00 1.00
13 250 1.40 100.00 100.00 100.00 1.00
13 260 0.02 100.00 100.00 100.00 1.00
14 270 0.55 100.00 100.00 100.00 1.00
14 280 0.82 100.00 100.00 100.00 1.00
15 290 3.63 100.00 100.00 100.00 1.00
16 300 0.00 100.00 100.00 100.00 1.00
✔ Saved pipeline to output directory
output\models\model-last
Let’s walk through the above console output. The train command sets off the training loop of spaCy which generates the pipeline, initializes model weights, and iteratively goes through a cycle of adjusting weights, checking losses and evaluating model accuracy against the validation dataset. Notice how the performance metrics: precision, recall, and f-score move towards a perfect 100 score with each passing cycle. With a larger and much diverse training/validation set, though, these metrics would generally converge towards the perfect score, but may not necessarily get there.
After reaching the configured threshold of about 300 steps, the training process stops, and two model images are saved to disk: best (with a maximum score against validation) and last (obtained in the last epoch cycle).
Predictions: It’s the moment of truth! Let’s see how our model performs over a few unseen address strings.
import spacy
nlp=spacy.load("output\models\model-best")
address_list=["130 W BOSE ST STE 100, PARK RIDGE, IL, 60068, USA",
"8311 MCDONALD RD, HOUSTON, TX, 77053-4821, USA",
"PO Box 317, 4100 Hwy 20 E Ste 403, NICEVILLE, FL, 32578-5037, USA",
"C/O Elon Musk Innovations Inc, 1548 E Florida Avenue, Suite 209, TAMPA, FL, 33613, USA",
"Seven Edgeway Plaza, C/O Mac Dermott Inc, OAKBROOK TERRACE, IL, 60181, USA"]
for address in address_list:
doc=nlp(address)
ent_list=[(ent.text, ent.label_) for ent in doc.ents]
print("Address string -> "+address)
print("Parsed address -> "+str(ent_list))
print("******")###Prediction output###Address string -> 130 W BOSE ST STE 100, PARK RIDGE, IL, 60068, USA
Parsed address -> [('130', 'BUILDING_NO'), ('W BOSE ST', 'STREET_NAME'), ('PARK RIDGE', 'CITY'), ('IL', 'STATE'), ('60068', 'ZIP_CODE'), ('USA', 'COUNTRY')]
******
Address string -> 8311 MCDONALD RD, HOUSTON, TX, 77053-4821, USA
Parsed address -> [('8311', 'BUILDING_NO'), ('MCDONALD RD', 'STREET_NAME'), ('HOUSTON', 'CITY'), ('TX', 'STATE'), ('77053-4821', 'ZIP_CODE'), ('USA', 'COUNTRY')]
******
Address string -> PO Box 317, 4100 Hwy 20 E Ste 403, NICEVILLE, FL, 32578-5037, USA, US
Parsed address -> [('4100', 'BUILDING_NO'), ('Hwy 20 E', 'STREET_NAME'), ('NICEVILLE', 'CITY'), ('FL', 'STATE'), ('32578-5037', 'ZIP_CODE'), ('US', 'COUNTRY')]
******
Address string -> C/O Elon Musk Innovations Inc, 1548 E Florida Avenue, Suite 209, TAMPA, FL, 33613, USA
Parsed address -> [('C/O Elon Musk Innovations Inc', 'RECIPIENT'), ('1548', 'BUILDING_NO'), ('E Florida Avenue', 'STREET_NAME'), ('TAMPA', 'CITY'), ('FL', 'STATE'), ('33613', 'ZIP_CODE'), ('USA', 'COUNTRY')]
******
Address string -> Seven Edgeway Plaza, C/O Mac Dermott Inc, OAKBROOK TERRACE, IL, 60181, USA
Parsed address -> [('Seven Edgeway Plaza', 'STREET_NAME'), ('C/O Mac Dermott Inc', 'RECIPIENT'), ('OAKBROOK TERRACE', 'CITY'), ('IL', 'STATE'), ('60181', 'ZIP_CODE'), ('USA', 'COUNTRY')]
******
Not bad at all! 😀 Except for a few address patterns, our model works reasonably well to identify most of the entities accurately in the address string. As we add more training examples to update our model and start extracting other address entities (say Apartment number, PO Box, etc.), we’d observe better prediction results.
Bonus Goodie: Coupling NER with Pattern matching
Although a well-trained model would certainly give a boost to our address data parsing capability, we may still see some whimsical predictions here and there. While there may not always be an easy way out of these, spaCy does come with provision to reinforce models through a set of pattern-based rules, covered via its Entity Ruler.
Let’s run through another address string for our parser.
address="C/o John Doe LLC, 111 8th Avenue Ste 1301, Tulsa, Oklahoma, 74136–1922, USA"
doc=nlp(address)
ent_list=[(ent.text, ent.label_) for ent in doc.ents]
print("Address string -> "+address)
print("Parsed address -> "+str(ent_list))
#######################Address string -> C/o John Doe LLC, 111 8th Avenue Ste 1301, Tulsa, Oklahoma, 74136–1922, USA
Parsed address -> [('C/o John Doe LLC', 'RECIPIENT'), ('111', 'BUILDING_NO'), ('8th Avenue', 'STREET_NAME'), ('Tulsa', 'CITY'), ('Oklahoma', 'CITY'), ('74136–1922', 'ZIP_CODE'), ('USA', 'COUNTRY')]
Quite erratically, we see Oklahoma being identified as a City rather than a State! We’d deduce this as an outcome of using training address data which only had two-lettered state codes but not the expanded state names. One way to get around this would be through adding more such patterns in training data.
Alternatively, we can create pattern-based regex rules to handle these. Let’s create a pattern file with the list of all US states.
{"label":"STATE","pattern":[{"LOWER":"alabama"}]}
{"label":"STATE","pattern":[{"LOWER":"alaska"}]}
{"label":"STATE","pattern":[{"LOWER":"arizona"}]}
{"label":"STATE","pattern":[{"LOWER":"arkansas"}]}
{"label":"STATE","pattern":[{"LOWER":"california"}]}
{"label":"STATE","pattern":[{"LOWER":"colorado"}]}
{"label":"STATE","pattern":[{"LOWER":"connecticut"}]}
.
.
.
To make use of these pattern rules in our model, we’d modify our training configuration to add entity-ruler as an additional module along with pattern file reader settings.
.
.
[components.ner]
factory="ner"[components.entity_ruler]
factory="entity_ruler"[initialize]
[initialize.components]
[initialize.components.entity_ruler]
[initialize.components.entity_ruler.patterns]
@readers = "srsly.read_jsonl.v1"
path = "corpus\rules\entity_ruler_patterns.jsonl[nlp]
lang = "en"
pipeline = ["ner","entity_ruler"]
.
.
We follow the same process, yet again, to generate the full config file and train through it to generate a rule augmented prediction model.
Let’s see the results.
nlp=spacy.load("output\models_er\model-best")
doc=nlp(address)
ent_list=[(ent.text, ent.label_) for ent in doc.ents]
print("Address string -> "+address)
print("Parsed address -> "+str(ent_list))###################
Address string -> C/o John Doe LLC, 111 8th Avenue Ste 1301, Tulsa, Oklahoma, 74136–1922, USA
Parsed address -> [('C/o John Doe LLC', 'RECIPIENT'), ('111', 'BUILDING_NO'), ('8th Avenue', 'STREET_NAME'), ('Tulsa', 'CITY'), ('Oklahoma', 'STATE'), ('74136–1922', 'ZIP_CODE'), ('USA', 'COUNTRY')]
Perfect! With the careful addition of more such rules(Counties, Country naming patterns), we can further improve the accuracy of our model.
Final Thoughts
Using spaCy’s powerful NLP-NER capabilities, augmented with its unique rule engine offering, we have demonstrated how easily (almost zero code and largely configuration driven training process!) and economically(no GPU requirements), a minimal address parsing implementation can be built.
While this implementation in itself is not sufficient to handle varied address data semantics, a good data pre-processing strategy coupled with larger, diverse training sets is expected to yield better results as against working through a pure rule-based parsing approach.
In this case study, we limited our scope exclusively to US addresses. However, another interesting extension to the address parsing problem would be to resolve the country for an address (before running through the country-specific parser) from a mixed dataset of different geographies. Perhaps, a suitable use case to explore with spaCy’s tokenizer and sklearn’s multiclass algorithm. But, let’s save that for another blog post!