Very fast Data cleaning of product names, company names & street names (2015)

Source: https://www.flickr.com/photos/theredproject/3968278028

The correction of product names, company names, street names & addresses is a frequent task of data cleaning and deduplication. Often those names are misspelled, either due to OCR errors or mistakes of the human data collectors.

The difference is that those names often consist of multiple words, white space and punctuation. For large data or even Big data applications also speed is very important.

The SymSpell algorithm supports both requirements and is up to 1 million times faster compared to conventional approaches (see benchmark). The C# source code is available as Open Source on GitHub). A simple modification of the original source code will add support of names with multiple words, white space and punctuation:

You can simply use CreateDictionaryEntry("company/street/product name", "") to add multi-word company, street & product names to the dictionary. Spaces within the names are allowed.

Then with Correct("misspelled street",""); you will get the correct street name from the dictionary. With the verbosity parameter you may specify whether you want only the best match or all matches within a certain edit distance (number of character operations difference).

For every similar term (or phrase) found in the dictionary the algorithm gives you the Damerau-Levenshtein edit distance to your input term (look for suggestion.distance in the source code). The edit distance describes how many characters have been added, deleted, altered or transposed between the input term and the dictionary term. This is a measure of similarity between the input term (or phrase) and similar terms (or phrases) found in the dictionary.


Originally published at blog.faroo.com on September 29, 2015.