How to match tables without unique identifier: An introduction to fuzzy matching
You have two tables, let’s say a customer list from your CRM system, and a list of leads to be targeted by your sales department. Of course, this lead list does not contain an identifier which would make it easy to match it with the CRM list and make sure you don’t contact the same organization twice. How can you solve this problem? It seems quite simple at the first glance. Common spreadsheet tools provide lookup features. However, company names are not always spelled exactly the same in both lists. For example, your CRM contains a customer called “Sun Incorporated”. However, in the lead list, this company is called “Sun Inc”. Lookup functions, however, usually do not find entries that are not exactly the same.
You could find the non-exact matches by going through the list manually. If you have a small data set, this might be a viable option (yet not the most fun task…). As soon as your data set has a few thousand rows, however, this problem should not be addressed manually. It is not only a very strenuous and time-consuming task to identify duplicates with your eyes in large data sets. It is also prone to errors as you actually would have to compare each single pair of entries which is nearly impossible with bigger tables.
Fortunately, this problem can be solved with algorithms. Those algorithms are based on fuzzy rules. The term “fuzzy” refers to the deviations that these rules allow from a perfect match. A fuzzy rule could, for example, allow for two characters within a string variable (i.e. a text variable that does not have a numerical meaning such as a customer’s name) to be interchanged, such as “Cahtrine” instead of “Cathrine”. The closeness between two entries is usually measured by a similarity function. If this function exceeds a predefined threshold (which we call “c”) then the two entries are flagged as a match.
One challenge of developing a fuzzy matching algorithm is to find meaningful similarity functions and fuzzy matching rules. There are hundreds of possible fuzzy rules that could be considered to find matches. One rule could be based on similarity considering characters which are “neighbors” at the keyboard and thus are likely to be interchanged by the person entering the data point. But also translations to other languages or similar meanings should be very close to each other. It could also be any kind of different abbreviation rules and all combinations of neighbor characters in a word that could be interchanged by the person entering the word. This is why fuzzy matching algorithms require vast computational capacities, particularly when big data sets are analyzed. It is another challenge to define the algorithms as smart as possible in order to make run time as short as possible.
Therefore, it is helpful to define fuzzy matching rules that consider the domain and the data generation process of the data under investigation. For user-generated data, it is in general important to consider the keyboard and its language. For product data, the fuzzy matching algorithm should not be case sensitive as products are often entered by a lot of different persons or parties who are used to different conventions.
Where it really becomes complex, is company names. Did you know, for example, that the German automotive company “BMW” is actually called “Bayerische Motorenwerke”? A generic algorithm will hardly find the acronym and the full company name similar.
Another use case for fuzzy matching algorithms is to remove duplicate entries from a database. If the entries match exactly, it is not hard to identify them even with common spreadsheet tools such as Microsoft Excel. However, when you want to delete duplicate entries e.g. from a list of customers, your can use fuzzy matching routines to identify those duplicate entries.
If you want to read more technical stuff about fuzzy matching, we recommend this paper by Microsoft Research: Chaudhuri et al. (2003): Robust and Efficient Fuzzy Match for Online Data Cleaning
Originally published at www.prepjet.de on January 14, 2016.