Handling (messy) address data

Published in

Incognia Tech Blog

8 min readJun 3, 2021

Thanks to Vinicius Cousseau for contributing to this blog post.

Matching addresses is an important problem in the fraud detection industry: we need to make sure that an address submitted by a user matches one from a reliable source (a government-related data source, for instance), in order to evaluate if the submitted address is consistent. Also, being able to consistently group generic address strings by their real meaning — the physical location which they represent — is important to avoid duplicated information and improve the quality of products built upon them.

Matching strings by themselves is already a difficult enough challenge for which there are many approaches available ranging from machine-learning to purely algorithmic techniques. On top of that, in our specific case, we also need the match to be as fast as possible, even allowing ourselves to sacrifice some correctness, if necessary.

However, when dealing with addresses, we bring up yet another level of complexity. Incognia’s API addresses can come as a single string, or as structured objects with hierarchical fields representing the various components of an address, like the city, street, and number, for example. Besides that, an address carries much more context-related information, which in turn can cause variance for addresses representing the same physical location. For example, we know that a string related to a street name may end with an abbreviation for the road name, as in “West 54th St.”, but could also come without it, as “West 54th Street”, or the number could come as a range, like “First Street, 2–5” in contrast with “First Street, 2”. So, we have to consider this and other kinds (and there are many) of problem-specific information to improve the quality of our match, using this information to normalize the address strings.

That said, we begin to describe our approach to normalize and match address strings, in order to use them in our anti-fraud products. There is no better place to start than with our data.

Our Data

The main source of our address data is end-users. We receive address data from multiple companies, like fintech, digital banks, delivery apps, and many others. The range of industries makes our address data very irregular in terms of content since the way the addresses are collected from the users can change from company to company. The address data is however regular in terms of format since all companies adhere to the Incognia’s API format for addresses, that is an address line, normally separated by commas, or a structured object that looks like this:

"structured_address": {
    "locale": "en-US",
    "country_name": "United States",
    "country_code": "US",
    "state": "NY",
    "city": "New York",
    "borough": "Manhattan",
    "neighborhood": "Midtown",
    "street": "5th Av.",
    "number": "123",
    "complements" : "2nd Floor",
    "postal_code": "10110-0001"
  }

We store the addresses in an encrypted manner, only to be decrypted by the services that will use them to derive the anti-fraud evidence based on an address match.

Address Normalization

To make the address strings comparable, we first must normalize them. However, this is not a fail-proof process, but, instead, a collection of steps that target direct and common modifications of address strings done by end-users and tries to revert those modifications to a single standard.

It is important to say that there are already some very useful libraries like libpostal or usaddress that offer address normalization.

However, using these off-the-shelf tools may not solve each problem regarding normalization in real-world scenarios. As we’ve noted, our data comes from a great variety of sources and each of those can treat addresses in their own customized way, which can lead to different address content that may not be suitable to be used along with these libraries. Besides that, there is also the performance side of the problem, since, for example, libpostal requires almost 7GB of memory footprint to work with a non-trivial deployment process.

That is the reason at Incognia we are dealing with a more personalized solution: the only guide on how good our normalization is performing must be our data.

Let us get to it, then.

Field separation

The first step in the normalization stage is to separate the fields of an address. One way of doing that is to break the fields by some separator and use the XAL format to represent the address. Here, we will keep it simpler, and just use the fields as they are submitted by the user. We assume that the user inputs an address in the following format

{
  "country": "United States",
  "state": "NY",
  "city": "New York",
  "borough": "Manhattan",
  "neighborhood": "Midtown",
  "street": "5th Av.",
  "number": "123"
}

Basic steps

We first begin with some basic, well-known steps for normalizing the address. They are:

Lower case everything: we make sure that we are, in terms of comparing, only paying attention to which character we are comparing, instead of the case of it.
Strip accents: accents are very common in languages other than English, like Portuguese, Spanish, French, German, and many others. The point here is to remove those accented characters since they compare differently to the striped ones. An example of that is “são paulo” which gets transformed to “sao paulo”.
Remove punctuation, symbols, and meta-information: the next step is removing punctuation, symbols, and meta-information, mainly because some users tend to use them and some not. So, for example, we remove “.”, “,”, “-”, “state of” from fields (the last one only from the state field).

Then, we get:

{
  "country": "united states",
  "state": "ny",
  "city": "new york",
  "borough": "manhattan",
  "neighborhood": "midtown",
  "street": "5th av",
  "number": "123"
}

Replace numbers and ordinals

After the basic steps, we can proceed to replace the numbers, roman numerals, and ordinals in the fields by writing them out. For example, we want to transform “5th avenue” into “fifth avenue”.

The idea behind doing that is to achieve a better comparison or distance measure among those fields, as we can transform “5th avenue” into “6th avenue” by only changing a single character, whereas transforming “fifth avenue” into “sixth avenue” requires more changes, making them easier to tell apart.

We can also choose which fields to apply this transformation, since for the number field it is better to keep it as it is and parse it as an actual number, and, if needed, detect number ranges, like “Main St., 5–20”.

{
  "country": "united states",
  "state": "ny",
  "city": "new york",
  "borough": "manhattan",
  "neighborhood": "midtown",
  "street": "fifth av.",
  "number": "123"
}

Strip bracketed information

Another useful transformation is to remove any text that is enclosed by brackets, parentheses, and the like. For example, transform “fifth avenue (next to Adidas shop)” into simply “fifth avenue”. The reason behind this is that some users tend to add this extra information in fields as a way of giving reference points to their addresses.

In our implementation, this is done by a simple regex match.

Replace and remove common tokens

This step is responsible for tackling common tokens related to the address fields. For example, for the street field, we may see tokens of the sort of “st”, “av”, “rd”, called abbreviations, that can be replaced by “street”, “avenue”, “road”, their expansions, and then we can even delete those expansions so that they cannot interfere with the comparison part of the match.

This also works for state abbreviations. We replace the abbreviations with the full state name in order to allow a fuzzy match in this field. For example, when we have two state fields being “ny” and “neew york”, it is better to transform the first one into “new york” and then compare instead of trying to convert the last one to the abbreviation.

We also remove prepositions and connectives like “of”, “and”, and “to” from the fields to perform better tokenization in the matching step.

The way by which this replacement is made is quite simple and standard (see the abbreviation expansion section on the libpostal inner works post): we use a dictionary that maps the abbreviation to the correspondent expansion, allowing even several abbreviations for a single expansion.

{
  "country": "united states",
  "state": "new york",
  "city": "new york",
  "borough": "manhattan",
  "neighborhood": "midtown",
"street": "fifth",
"number": "123"
}

String matching

Once addresses are normalized, we need to be able to match them. We do this field by field in a hierarchical manner — we first match countries, then states, and so on — and for doing that we follow two steps.

The first step is to break the fields into several tokens, splitting them by the space character. For example, “santa monica” becomes [“santa”, “monica”]. This is done so we can compare each token individually instead of the entire string.

After that, we calculate a measure of similarity between each token. In our implementation, this is done by using the Levenshtein distance, which can be seen as the minimum number of single-character edits necessary to turn one string into the other. We chose the Levenshtein distance as a distance measure to do it since it proved to be good enough to perform the match using our normalized data and has a great performance. However, other string metrics could be also used as well as ML-related approaches.

Once each token is compared with its correspondent, we accumulate the results and expose a final score of similarity to the comparison of that specific field, which can be used to assert if it is a match or not.

Iteration

A final and continuous step to our address match is iteration.

Many factors can change the distribution, correctness, or performance of our matching pipeline, including the arrival of new companies in our client base, changes to current ones, and new product necessities, so we need to keep updated with that as well.

That is why we are constantly analyzing new data, looking at performance metrics, and cherry-picking failing cases in order to find weaknesses in our current strategy.

Conclusion

So, that is it! This was a brief review of how we perform normalization and match of addresses in our services.

We did not provide a full and thorough specification of our implementation here, mainly because, as we said, doing normalization and address match depends very much on the format of the address data, not only in terms of how it was created, but also by whom it was created — the end-users — since this data can vary across countries, regions, and even among different people.

We hope, however, to have contributed to a more general understanding of the things that we look for in an address to improve the quality of the final match, the basic forms of normalization, that are simple and fast, and the look of our address data pipeline.

Are you interested in building frictionless anti-fraud products with us? Check out our developer portal!