Matching company names at scale!

Max Baak
inganalytics.com/inganalytics
8 min readMay 5, 2024

ING bank’s Entity Matching Model emm has been open-sourced. Go check it out!

TL;DR

At ING Wholesale Banking Advanced Analytics we have open-sourced the Entity Matching Model (EMM) Python package, which solves the problem of matching company names between two possibly very large datasets. Our solution is designed to handle at speed large datasets with millions of names. EMM has both a Pandas and Spark implementation, giving identical name-matching results. We invite you to try it out!

The problem we solve

The problem at hand is to match company names between two datasets, both possibly very large. For real-world datasets identifiers such as LEI (Legal Entity Identifier) codes may be unavailable, or only partially available. By solving the name matching problem, we can still use datasets where identifiers are missing. See the figure for an example.

There is the ground truth (GT) list of names (on the left), often a carefully curated set, to which other names are matched. Names from an external data set (on the right), possibly of very low quality, are matched to the GT. For each name to match, we calculate the similarity to a relevant subset of all names from the GT (grey arrows), and then select the best matches (orange arrows).

Name matching is a quadratic problem, one that easily becomes computationally intensive for large datasets. The longer the GT, the more good-looking false-positive candidates are found per name to match. For example, take a GT set with 10M names and an external dataset with 30M unique names. Even with an algorithm that compares 10k name-pairs per second, matching all names to the full GT would take almost 1000 years!

At ING Wholesale Banking Advanced Analytics we use the EMM package to do company name matching at scale; a package we’ve used for multiple years. In our cluster (~1000 nodes), this example name-matching problem can be performed in about an hour.

Example run

As a quick example of the emm library, one can do:

pip install -U emm
from emm import PandasEntityMatching
from emm.data.create_data import create_example_noised_names

# generate example ground-truth names and matching noised names, with typos and missing words.
ground_truth, noised_names = create_example_noised_names(random_seed=42)
train_names, test_names = noised_names[:5000], noised_names[5000:]

# two example name-pair candidate generators: character-based cosine similarity and sorted neighbouring indexing
indexers = [
{
'type': 'cosine_similarity',
'tokenizer': 'characters', # character-based cosine similarity. alternative: 'words'
'ngram': 2, # 2-character tokens only
'num_candidates': 5, # max 5 candidates per name-to-match
'cos_sim_lower_bound': 0.2, # lower bound on cosine similarity
},
{'type': 'sni', 'window_length': 3} # sorted neighbouring indexing window of size 3.
]
em_params = {
'name_only': True, # only consider name information for matching
'entity_id_col': 'Index', # important to set both index and name columns to pick up
'name_col': 'Name',
'indexers': indexers,
'supervised_on': False, # no supervided model (yet) to select best candidates
'with_legal_entity_forms_match': True, # add feature that indicates match of legal entity forms (e.g. ltd != co)
}

# 1. initialize the entity matcher
p = PandasEntityMatching(em_params)

# 2. fitting: prepare the indexers based on the ground truth names, eg. fit the tfidf matrix of the first indexer.
p.fit(ground_truth)

# 3. create and fit a supervised model for the PandasEntityMatching object, to pick the best match (this takes a while)
# input is "positive" names column 'Name' that are all supposed to match to the ground truth,
# and an id column 'Index' to check with candidate name-pairs are matching and which not.
# A fraction of these names may be turned into negative names (= no match to the ground truth).
# (internally, candidate name-pairs are automatically generated, these are the input to the classification)
p.fit_classifier(train_names, create_negative_sample_fraction=0.5)

# 4. scoring: generate pandas dataframe of all name-pair candidates.
# The classifier-based probability of match is provided in the column 'nm_score'.
# Note: can also call p.transform() without training the classifier first.
candidates_scored_pd = p.transform(test_names)

# 5. scoring: for each name-to-match, select the best ground-truth candidate.
best_candidates = candidates_scored_pd[candidates_scored_pd.best_match]
best_candidates.head()

The example illustrates all relevant steps in the name-matching process: (1) initilising the PandasEntityMatching class, (2) feeding it the ground truth dataset, (3) fitting the name-matching classifier, (4) matching a set of names, and (5) picking the best matches.

This example is based on a sample of Dutch Chamber of Commerce (KvK) data. The best candidates look as follows:

Every name-to-match (column ‘name’) has a ‘preprocessed’ equivalent, and is matched to ‘gt_preprocessed’, the processed equivalent of each GT name available (‘gt_name’). ‘score_0’ and ‘score_1’ are the (cosine) similarity scores of the two indexers used. ‘nm_score’ is the name-matching score of the classifier.

For Spark you can use the SparkEntityMatching class instead, with the same API as the Pandas version. For this install emm as:

pip install -U emm[spark]

For more examples see our tutorial notebooks.

About the EMM package

The EMM package solves two problems in order to perform efficient company-name matching at scale, namely:

  1. Selecting all relevant name-pair candidates quickly enough, and
  2. From those pairs accurately selecting the correct matches using tailored features.

For both steps we have developed fast, intelligent, and tailored solutions.

The selection of all relevant name-pairs is called the “indexing” step, consisting of a number of unsupervised indexing methods that select all promising name-pair candidates. To solve the speed problem, internally EMM uses the sparse_dot_topn package for making name-pairs, which is also built, maintained and open-sourced by ING. sparse_dot_topn calculates the best matches using efficient multiplication of sparse matrices and selecting the top results. See here for more details.

The second stage is called the supervised layer, and is done using a classification model that is trained to select the matching name-pairs. This is particularly relevant when there are many good-looking matches to choose from.

EMM can perform company name matching with or without the supervised layer present. A name-pair classifier can be trained to give a string similarity score or a probability of match (more on these below). For this, a training dataset of so-called positive names needs to be provided by the user. Positive names are alternative company names (eg. with missing words, misspellings, etc.) known to match to the GT dataset.

If no positive names are available, these can be created artificially with EMM by adding variations to the list of ground truth names. (These variations are only semi-realistic so this is a suboptimal solution.) Alternatively, when a list of names to match is available a user can manually label a subset of name-pairs that come out of the indexing step as correct and incorrect matches, and then simply train the supervised model on those. (EMM does not provide a labelling tool, but there are many around.)

Under the hood

The EntityMatching pipeline consists of two to four components, where the last two are optional:

  1. Preprocessor: Cleaning and standardisation of input names and their legal entity forms.
  2. Candidate selection: Generation of name-pair candidates, also known as indexing. Here we care about the running time and catching all relevant potential matches. Three indexers are available in the EMM package to do so: Word-based cosine similarity, Character n-gram based cosine similarity, and Sorted neighbourhood indexing. For the latter recordlinkage is used.
  3. Supervised model (optional): The classification of each name-pair, in order to pick the best name-pair candidate. This is optional but crucial for the accuracy of the model. We use xgboost as classifier.
  4. Aggregation (optional): Optionally, the EMM package can also be used to match a group of company names that belong together, with a single company name in the ground truth.

See the API for more details on these components.

The EMM library contains both a Pandas and Spark implementation. The Pandas and Spark version of EntityMatching both have almost the same API. The Pandas version is much more lightweight and meant for smaller datasets. It has much fewer dependencies (no Spark) and there is no initialization overhead. The Spark version however scales to much larger datasets.

Specialised matching features

Four types of input features are used in the company name matching:

  • String-based features: Multiple, conventional edit-distance based metrics such as Cosine similarity, Levenshtein or Jaro distance. For these the rapidfuzz package is used.
  • Rank features for a calibrated model: Powerful features to quantify differences between the various name-pair candidates that belong to the same name-to-match.
  • Legal entity form based features: Legal entity forms, such as: limited, ltd, plc, co, etc., can be extracted from the business names and compared for an exact, partial, or no match. For this the cleanco package is used.
  • Extra features: E.g. country comparison, or address, or other information available.

The emm package also works for regular names, but the legal entity form features help us specialize in the matching of company names.

The right score for you

Depending on the use-case, the supervised model trained without or with rank features may be preferred. In practice we find that users want one of two possible scores:

  • A string-similarity score: When interested in all potential good matches to a name, the model without rank features is useful: simply select all candidate pairs with a high similarity score. With a large GT dataset this list will contain false positives though, but one name-pair is (likely) the correct match.
  • A probability of match: This score is a probability that a name is a match or not. When only interested in matches with high probability, use the model with the rank features and require a high threshold. Any name-to-match with multiple good candidates will not make it through such a selection, as each one has a reduced probability of match. (E.g. with two good candidates each score reduces to 50% or less.)

The best use of both models could be: use the model without rank features to select any name-pairs with high string similarity. From those pairs, select the one with the highest model score with rank features to get the best possible match.

Conclusion

ING’s Entity Matching Model library emm has been open-sourced, which is specialised to do company name matching at scale. We invite you to try it out and are happy to hear your feedback!

And find our full documentation at read-the-docs.

Contributors

This package was authored by ING Analytics Wholesale Banking. Kudos to the many data scientists who have contributed to it over the past years! The current maintainers are: Max Baak, Simon Brugman, Tomasz Waleń, Ralph Urlus. And many thanks to Nikoletta Bozika for reviewing this blog.

--

--