Company Name Matching

Michiel Nijhuis
DNB — Data Science Hub

--

We have all been there: you have found two interesting datasets that could really supplement each other, but… you have no way of joining them together. When analyzing the data, you start searching for a way to join both datasets. The only field you find to join the datasets, is a name field and you discover that the spelling of these names is inconsistent to say the least. At the Dutch Central Bank, we frequently encounter this problem. We get company names from different sources, but sometimes a consistent identifier for these companies is lacking. In order to deal with this problem, we have created a Python package for fuzzy company name matching. In this blog, I will go over the steps that we take to be able to match company names and leverage the various datasets.

How We Match Names

When matching names, you often have large databases that need to be joined. Most name matching algorithms are computationally expensive if you take into account that each of the names should be analyzed pairwise, so for two datasets of 10.000 names, that will be already 100.000.000 pairwise comparisons. That is why we start with preprocessing to get the most out of perfect name matches. Next, we apply cosine similarity to choose candidates for the fuzzy name matching and we perform the fuzzy matching algorithms only on these data. Lastly, we do some postprocessing to determine how well two names actually match.

Preprocessing

Before trying to match company names, it is useful to do some preprocessing of the data, making the data easier to match. We will do this in several steps. Say we have the following company name:

We start by removing all capital letters.

Next, we replace non-ASCII characters.

Then, we remove punctuation, i.e. remove any character that is not a word or space character with nothing.

We remove common legal business suffixes, using a package called cleanco, which is able to process company names and remove terms referring to organization type.

Finally, we remove the most common words using regular expressions.

The idea behind this is to bring the name back to it essence and compare that, as most of the name similarity scores are normalized based on the length of the string. Obviously, depending on the data you have, not all of these steps are necessary. With the preprocessing done, we proceed with approximate string matching.

Cosine Similarity

Using cosine similarity is necessary as the more advanced string matching algorithms are computationally more complex. In this way, the potential number of matches can be reduced from a few million down to about fifty. This is done via the conversion of a string to an n-gram and applying a tf-idf transform.

This results in a sparse matrix with the size of all of the unique n-grams that occur in the dataset. In this matrix, only the elements which link to the n-grams present in the company name are filled. By calculating the dot product between the matrix for the entire dataset and the matrix for the names we want to match, we can get the cosine similarity between the two. From this cosine similarity, we can then run an partition function to select the top fifty best matches. For these matches, we can apply the fuzzy string matching.

Fuzzy String Matching

For the fuzzy matching of company names, there are many different algorithms available out there. To match company names well, a combination of these algorithms is needed to find most matches. Depending on the differences between two company names, different algorithms should be used. In this case, we will be using three algorithms which I will now discuss in turn.

Discounted Levenshtein

The first way in which we judge how well two strings match, is the discounted Levenshtein distance, using the abydos package. The Levenshtein distance can be obtained by changing one string to another by substitution, insertion and deletion. The discounted version is a variation on the Levenshtein distance where differences at the end of the string are penalized less than those at the beginning. This is handy for company names as suffixes to names are far more common than prefixes.

A score of 1 implies a perfect match. Here you can see that even though pepsi cola company has more letters in common and requires fewer edits then coca-cola group, it is still ranked lower, because of the discounting of the edits further down the name string.

String Subsequence Kernel Similarity

A different way of trying to match strings is by looking at possible substrings between the two strings. By dividing the name into (non)-continuous substrings a difference between the two sets of substrings can be determined. An SVM can subsequently be applied to generate a difference score between the two strings. For longer names, this gives a better idea of the matching.

You can see that even writing out the full legal suffix of the company has less of an effect then making a few typo’s in the company name. This allows us to also match long names and names with large differences in length well.

Token Sort

The last metric we now take into account is the token sort distance, which first tokenizes the data, then sorts the tokens (using the thefuzz package). Based on these sorted tokens a Levenshtein distance can be determined. This is especially useful when the words from the company name get scrambled.

Here, you can see that the switching around of words no longer affects the score that a match will get.

Post Processing

After applying the fuzzy matching, we have a score indicating how well two company names match for each of the algorithms. These scores can be combined to get a score for how well the two company names match. Depending on the goal of the name matching, some post processing might be necessary. During the post processing you can flag potential false positives. When matching fund names, for instance, it often occurs that you have different rounds of a fund, e.g. Sustainable Equity Fund I and Sustainable Equity Fund II. These give a high matching score, but should be differentiated in some cases. Specifically scanning for these kinds of differences and flagging these results can avoid making these false positive matches.

NameMatcher

In order to simplify our name matching process, we developed a name matching Python package. In this package, we can initialize a NameMatcher class object with the required preprocessing steps and the top n matches that should be returned from the cosine similarity step.

Next up we can set the algorithms we want to use for the fuzzy name matching.

We can then load in our two datasets and indicate which column should be used for the name matching.

The package will perform the name matching and provide us with the best matched options from the dataset including the score.

Conclusion

Using our name_matchingPython package, we can easily match the names of companies with many different algorithms depending on out data. With a scores between the 0 and 100 for each of the matches, we can also choose how many false positives we can accept. So in cases where we really need to be sure, a score of 95 or higher is used as threshold, while in other cases it will be lower. Checking the matches near this threshold gives us an idea about the number of false positives /negatives in our matched data.

At DNB, we are receiving more and more data where both a company name and an identifier is used. Based on these kind of datasets, we can use different company names with the same identifier to build a list of alternatives for a company name. The resulting dataset can be a training dataset for a neural net based name matching approach once we have enough data, taking the process of name matching one step further in the future.

TL;DR

In order to match company names from different datasets not sharing any identifiers, we developed a Python package called name_matching, to help us with that problem. It is available on the DNB Github.

--

--