An Ensemble Approach to Large-Scale Fuzzy Name Matching
by Ranjan Kant and Piyush Sagar Mishra
Databases are a part of everyone’s life — from swiping our cards for the morning coffee to posting photographs of a weekend brunch; from checking bank balance to making an online purchase — we access databases all day long. Changing business requirements and the evolution of the Internet have led to new types of databases, including tightly coupled relational databases, document storage systems, key-value stores, and wide-column stores. As consumers adopt newer data formats such as structured, text, images, video, audio, and machine logs, enterprises observe the intersection of business requirements and evolving data formats and, in response, integrate newer algorithms and technology frameworks into their ecosystems.
As the amount and types of data continues to increase exponentially, multiple challenges have emerged, most frequently as a function of the scale and noise in data. One such challenge is Approximate String Matching or Fuzzy Name Matching in which, given a name or list of names, the goal is to find out the most similar name(s) from a different list. The domain of Fuzzy Name Matching is not new, but with the rise of mobile and web apps, social media platforms, new messaging services, device logs and other open data formats, the nuances of data have grown, making the challenge of name matching increasingly complex.
To appreciate the problem, imagine that you are the owner of an international logistics company. Your delivery team receives several thousand packages every day. For each one, they must scan the package’s bar code and check the delivery details via an online portal and then enter name and address details on a GPS service to begin the delivery process. After closely observing this process, there is a significant delay in the dispatch time because of issues caused by manually entered information such as sender notes or system entries. Some of the common issues the team in charge of entering this data may encounter on a daily basis include:
- Companies with varying prefixes or suffixes: PT Indo Tambangraya Megah, Tbk vs. Tamban graya Megah, Indonesia.
- Names that have been abbreviated: MSFT Corp vs. Microsoft Corporation.
- Names that have a region’s name accompanying the core name: Suzhou Synta Optical Technology.
- Languages and scripts that vary across regions: Colgate vs. 高露洁 (gāo lù jié)
- Names in the system that have been misspelled: Giordano International Limited, HK vs. Gordano Intl’ Ltd., Hong Kong
- Information that exists in the database in varying formats: Key Products — Iron Ore, Copper vs. “…deals with export of iron ore and copper…”
- Names with extra or missing whitespaces: Chu Kong Passenger Transport Co Ltd. vs. ChuKong Passenger Trans. Co.
If there were only few hundred such records, your team might possibly be able to match the names manually, comparing every string with all the other strings and selecting the similar ones. But your logistics company deals with millions of these records. Manual matching would be not just impractical, but unimaginable. (For a more rigorous formulation of the problem, please refer to [1], [2] & [3].) These strings of text are your only unifying point for these records, so correctly matching similar strings is vitally important. Duplicate customer records can cause poor targeting (repeated customer names), weak delivery (search results gone wrong), or wrong marketing (different products sold to same customer, or the potential of spamming).
Even though the problem of matching these strings is almost omnipresent — and highly critical to customer service — increasing variability and complexity in textual data continue to make string matching a daunting challenge. While there is an abundance of search tools on the market, name search is a different beast, and requires a fundamentally different approach.
In the following article, we will outline a way to tame that beast. The “ensemble” approach to fuzzy name matching delivers the kind of precision you need to avoid customer problems, and does so at an enterprise scale. We used this approach with a BCG client, in this case a large corporate bank. Before we started the process, the bank required a team of 10 to perform the string matching on a dataset that was 100 times smaller than the bank’s complete customers database.
The bank’s existing process of manual inspection was very good at returning very high-quality matches. The problem was that after the database team filtered the names for quality, the sales team received only tens of customer leads. After we implemented the ensemble fuzzy name matching engine, the results met the same standard of accuracy as the manual process — but increased the number of quality leads 500-fold and produced this increase within a month of the engine’s launch. With the matching engine in place within a total project time of just three months, the bank enjoyed a dramatic leap in the number of leads, required fewer head count to achieve better results, could more efficiently allocate valuable data-scientist time, and experienced lower operational expenses. This article will provide a detailed description of how our BCG team achieved all of these results for our client.
Existing approaches
The problem of string matching is not limited to a specific industry. From e-tailers that must match millions of incoming search queries with product catalogs, to large government organizations that must match names and addresses for use cases such as identity management and watch-list tracking, a large-scale fuzzy matching engine is a modern enterprise necessity. In a global setting, the increasing vernacular content and vocabulary flexibility across languages and dialects means that fuzzy matching engines must deal with a host of complex issues, including:
1. Phonetic variations: Kohl’s versus Coles
2. Typographical mistakes: Microsoft vs. Microsft
3. Contextual differences: Company vs. Organization
4. Reordered terms: Sam Hopkins vs. Hopkins Sam
5. Prefixes and suffixes: AJO Technology Company Limited vs. AJO tech Private Co., Ltd.
6. Abbreviations, nicknames, and initials: AJ Wilson vs. Alex Jane Wilson
7. Cross-language semantics: Private limited vs. 私人有限
8. Transliteration differences: Traditional Chinese vs. PinYin
9. Truncated letters and missing or extra spaces: Chu Kong Transport Company vs. ChuKong Transport Co. Ltd.
The challenge of fuzzy matching (and related studies around Entity Resolution, Record Linkage, and other issues) has confronted companies and academia alike for several years. Thanks to the growth of Python and related machine-learning libraries, multiple approaches to tackle such challenges now exist. The problem is that most of the known frameworks (both standalone and hybrid) are either suited to a specific use case or require significant customization before they can be deployed in an enterprise environment. Large organizations such as Amazon, Google and Microsoft have invested significant time and effort into building engines that look for patterns in queries, match the patterns with the search or user context and then provide results or suggestions. Less highly capitalized organizations, though, are still trying to master the art and science of performing approximate string matching at scale. Some of the most common approaches organizations leverage to perform string matching include (more details here):
- Common key method for phonetic or information similarity: The crux of this approach is to reduce the strings to a key based on their pronunciation or language semantics. Some of the most common algorithms used in this approach include Soundex, Metaphone, Double Metaphone, Beider-Morse.
- Edit-distance method: This method is one of the most frequently used approaches to tackle the fuzzy matching problem, and comes as a standard module in most of the analytics/BI platforms that support Data Processing/ETL options. The basic approach of the algorithms that belong to this method is to look at how many character changes (number of character insertions, deletions, or transpositions) are required to get from one name to another. Industry standards such as Levenshtein distance, Jaro–Winkler distance, and Jaccard similarity coefficient fall under this method.
- Statistical-similarity approaches: A statistical approach takes a large number of matching name pairs (training set) and trains a model to recognize what two “similar names” look like so the model can take a set of two names and assign a similarity score. These statistical approaches work extremely well for severe issues, and can also support names across different scripts. The drawback is that they have a high barrier to entry, since preparing the training dataset with matching names requires significant manual effort — and significant time.
- Word-embedding methods: Nuances in datasets are not always related to the content: They can also arise due to the context in which the information appears. For example, “drug” and “pharmaceutical” mean the same thing in the context of pharma companies, but regular fuzzy matching approaches based on phonetic, edit distance, or lists would not capture this similarity. This problem is typically observed when dealing with organization names instead of proper nouns, and grows more difficult as the number and length of words in a string grows. Word embedding — the numerical vector representations of a word’s semantic meaning — can capture these kinds of differences by looking at the similarity between the mathematical representation of two words. If two words or documents have a similar embedding, they will be viewed as being semantically similar. For example, the embedding of “woman” and “girl” are close to one another in terms of vector representations, which means they would be considered semantically similar. Applied to organizations, the word-embedding method can recognize, for example, that JDR Drugs and JDR Pharmaceuticals are most likely the same company.
- Miscellaneous: Apart from the commonly used approaches highlighted above, companies may also leverage custom algorithms built upon one or more of the above methods, or they may use other techniques such as common-key or short-text-clustering methods.
Challenges with existing approaches
Several comparative experiments have been performed to assess the suitability of algorithms ([4] and [5]), and in almost all the algorithms no one model is able to resolve all the issues. Even though the performance of each algorithm varies based on the context of the problem, the language used, or volume and runtime requirements, most of the name-matching issues can be summarized as:
- Inability to handle multiple scripts: Most of the known algorithms are bound by the constraint of scripts. Some are suited only to Latin, while others might cut across scripts. But most of these algorithms can handle only one script at a time. As such, organizations that deal with multi-lingual data sources or channels cannot use these algorithms as-is.
- Trade-offs between recall and precision: While approaches based on common-key or list-based methods offer a very high recall, their precision is extremely poor in high-variance corpora. Statistical-similarity approaches yield very high precision as well, but perform much more slowly and require a large volume of high-quality training data.
- High compute and runtime: The computing requirement for current state-of-the-art techniques continues to grow exponentially. Even though this issue can be managed fairly easily, runtime remains a significant concern, especially for e-commerce companies whose customers demand instant results and instant gratification.
- Lack of inherent feedback: The designs of most name-matching approaches have one commonality: Over time, they all restrict automated improvements. Heuristics can be applied to accommodate specific cases where a particular algorithm doesn’t perform well, but such fixes can very quickly lead to extremely complex and time-consuming designs.
- Difficulty interpreting stop words: Organizations typically curate custom stop words to clean strings before performing matching operations. Even so, many stop words can play a critical role in the semantics. In most cases, organizations make the call to include or exclude a stop word, based on the approximate proportion of matches affected.
- Handling multiple joining keys: Some organizations may deal with datasets involving multiple keys, each formed from a string field such as customer name and address; or name, description, and other product attributes. In such cases, the engine must be flexible enough to handle different string-matching pipelines, and then aggregate the score across each attribute to derive a unified score and return a list of matches.
- Unranked candidates need a heuristic layer to identify best matches: In many cases, the expectation of the user is to receive not just a list, but a ranked list that identifies the “best match,” “second-best match,” and so on. Ranked results are typically resolved using manual rules or thresholds that are fine-tuned over time. However, slight changes in the input-data distribution or similarity threshold can disrupt entire logic delicately predicated on the training dataset.
- Contextual similarities go unmatched: All the content-based methods fail to look at co-occurrence or contextual similarities and thus fail to identify similar entities, such as “drug” vs. “pharmaceutical.” Even though these differences occur only for the secondary strings in names (common nouns are not affected by these issues), recall of the overall method is extremely poor.
Our Ensemble Approach
In our recent project, the large bank mentioned above wanted to accelerate its growth in highly competitive markets (those in which the average competitor size was significantly larger than that of the noted bank). One of the use cases driving the overall vision was to identify new leads for their solution suite of more than 50 products. In order to generate leads, our BCG team compared more than 200 open, social and third-party data sources. From these, we shortlisted 45 data sources including internal datasets such as transaction, profile, and historical leads. Since the client’s focus was on the APAC markets, several data sources were available only in regional languages such as Chinese, Thai, and Bahasa. Our end objective was to prepare a data mart that could fuel insight generation and profile identification for more than ten million companies, and then to help our client shortlist and prioritize the leads.
There were five key challenges in building such a data asset:
- Identifying same companies across datasets: This was the core challenge in the entire design, and is the subject of the bulk of this article. With a volume of data that spanned multiple formats, we faced a formidable challenge in identifying the same companies across different scripts. To do so, we implemented a number of sub-modules:
- Normalize company names and addresses across all data sets and bring them into a standard format.
- Remove noise from names in terms of prefixes, suffixes, stop-words, regional addendum and other parameters.
- Match company names across different data sets.
- Account for language differences among separate pieces of company information presented in Latin, Mandarin, Thai, Bahasa, and other regional languages.
- Choosing the most meaningful information for each record in the dataset: Our second major challenge was interpreting the usefulness of a particular field in terms of completeness, coverage, usability and correctness. This was so challenging because almost all the datasets had the same information present across multiple fields: Key executives vs. key management, shareholders vs. key shareholding executives, buyers/suppliers vs. supply chain network.
- Processing text across each individual dataset to normalize the fields: Each of the 45 datasets had its own nuances:
- The datasets contained information in varying text formats: In some datasets, key products existed as keywords (copper, iron ore), while in others they existed as sentences (“…deals with export of copper, iron ore…”).
- Addresses were incomplete, pruned or incomprehensible in several places. For example, “234-xxx, Cecil Street, Alexandra Ro” or “Patin Majar, 23/245/xxx, Tanjong Pagar, Central Sing.”
- Text was stored using different delimiters (Commas, Pipe and other special characters such as ‘, ‘“, ‘-’) and was provided by vendors in different formats (JSON, CSV, XLSX, SQLDB, MongoDB, etc.).
- Aggregating information from different data sources: The client expected the master database to host more than 1,000 fields of information for more than 10,000,000 companies. These fields included details such as company profile, address, key executives, shareholders, financials, key buyers, suppliers, product descriptions, competitors and the latest industry developments. Once we had linked all the identical companies, our next step was to bring necessary information from all the data sets into an aggregated data mart.
- Building an audit and feedback mechanism: Since the client expected the framework to generate leads in a self-sustained manner, one of our key tasks was to develop an information audit (quality of the input data, leads and lead information displayed to the user in the presentation layer) and feedback layer that would continuously enhance the logic and improve the quality of leads.
Since we knew the constraints and solution frameworks available, the best way for us to build a large-scale fuzzy matching system was to “go hybrid.” In the course of a process that spanned three months for end-to-end development, we iterated more than 50 different components and tested them across a wide range of use cases (boundary conditions, overflows, format exceptions etc.) before developing the final fuzzy matching engine, which consisted of five core modules:
Module 1: Language detection and transliteration: As highlighted earlier, the data cut across multiple geographies and scripts, rendering most of the existing algorithms unusable without customization. Since the nuances varied across languages, our first step was to identify the language of an incoming string and generate translations in other languages. We tested multiple approaches to perform the language/script normalization. The modules we developed performed following tasks:
- Detected the language of the string passed: We used two core libraries, langdetect and spacy, as a foundation, and built a custom wrapper around them to improve the accuracy of mixed-language strings.
- Normalized languages: A small number of data vendors provide company names and addresses in multiple languages (for example, EMIS provides names of companies in Chinese and Thai, as well as English). Others, such as CapIQ, make names and addresses available in only one language. It was critical for accuracy that we made sure the available information could be mapped across all languages, so we built a transliteration layer that could make information in one language available across all others. To create this layer, we built a 3-stage pipeline:
- Plain-text translation using Google Translate API for Nouns: For data sources with information in only one language, we used Google translate API to translate each entity into the other two relevant languages. For example, we would convert a company name in traditional Chinese into English and Thai and add the conversions to other corpora.
- Syllable-based maximum matching for word alignment: We used a combination of forward and backward syllable-based maximum-matching approaches for transliterating long names and addresses. Our primary idea with this approach was to create a mapping between English and other languages’ syllables by preparing a lexicon to identify potential matches, and then use a combination of forward and backward maximum-matching algorithms. Finally, we would use a ranking logic to prepare the transliterated text.
- Preparing word/character embedding for multi-lingual comparison of secondary words (Company, Optical, Electricity etc.): Only 30 percent of all the shortlisted vendors provided company details in multiple languages. Therefore, it was critically important to build an offline dataset that could enable comparison of cross-language strings. We followed two mechanisms to solve this problem. First, we leveraged the 30-percent datasets capable of multiple languages (EMIS, for example, provides names of companies in both Chinese and Thai, as well as in English). Then we prepared custom word2vec models to generate relationships between words and characters in different languages, and used those similarity matrices for transliteration. Our second mechanism was to use some of the pre-trained models (here, and here) to perform word- and character-level segmentation and matching. We then used the final embedding we had obtained for matching strings and Named Entity Recognition.
Module 2: Pre-processing textual data: We cleaned the text for whitespaces, falsely parsed text, and special characters using a standard pipeline that tackled different fields using a combination of these parameters:
- Case normalization
- Tokenization performed at a word, segment, or character level depending upon the script
- Special characters treated using a custom list prepared for each language
- Stemming using Porter Stemmer as default, with Regexp Stemmer for non-Latin languages
- Stop word treatment: We deployed into the engine three unique sets of stop words:
- A dynamic word/character importance calculator. Depending upon the script, we tended to pick words for Latin and characters for Chinese and Thai languages to derive the top-n words in each corpus per attribute (name, address, email and key banking products). (Please note that, for the sake of simplicity, we have used both words and characters interchangeably in the following sections.)
- List of standard prefixes/suffixes/geographies in Chinese, Thai, Bahasa and English languages
- Replacement words: We fed some replacements into the engine to normalize the content (e.g. coltd as co. ltd.), and identified replacement keywords based on exploratory analyses using word frequency/co-occurrence and business heuristics.
Module 3: Named Entity Recognition & Classification (NERC): One of our key challenges was to identify entities in the names and descriptions that could help link companies across the supply chain and establish buyer-supplier relationships (with respect to the products and services). Our objective was to use this established flow of goods and money to identify high-value leads (more connections, more product lines, etc.). NERC is a process of recognizing information units such as names, including person, organization and location names; numeric expressions including time, date, and money; and percent expressions from unstructured text. Our goal was to develop practical and domain-independent techniques that would enable us to automatically detect named entities with high accuracy. Since none of the existing libraries (NLTK, Spacy, SciPy) provided NER tagging out-of-the-box for our use case, we built a custom NER engine to classify information about company and product description into fields as Product, Company Feature, Service offered or “Other” attributes. In order to generate the vocabulary for standard products and services, we gathered information from Global Industry Classification codes such as SIC (Standard Industry Classification) and NAICS (North American Industry Classification System), and regional standards such as HSIC (Hong Kong Standard Industry Classification)
Module 4: Fuzzy Matching: We performed the actual matching in two stages; a low-precision hashing pipeline and a high-precision computation pipeline:
- Hashing Pipeline: Our objective for this first pipeline was to identify almost-definite matches and prepare hashes that could be used in the second pipeline for more accurate (but, slower) matching. We assembled four separate threads to run in parallel and perform following operations:
- For the first thread, we performed HDBSCAN clustering on company names and address (separate corpus). HDBSCAN is an extremely fast clustering approach that uses density of vectors based on the input strings to create clusters. In order to generate the matrix for clustering, we created n-grams at a character level (for non-Latin scripts) and bi-grams and tri-grams at a word level (as default). We used a tf-idf vectorizer to prepare the input matrix for the clustering exercise, and an extension of HDBSCAN to create different clusters such that the number of clusters was heuristically identified based on the vocabulary size.
- For the second thread, we leveraged a wrapper around the default Fasttext clustering approach. We then made two modifications in the wrapper to optimize speed and accuracy, indexing the dictionary entries by the n-grams contained by them to allow for a fast dictionary lookup, and limiting the matching process to the words that had at least one n-gram in common with the other words.
- We then performed Phonetic Hashing using NYSIIS and Double Metaphone to generate a 2-pair hash for each string. We chose these algorithms because they create an encoding for English words. For example, the Double Metaphone outputs a primary encoding and a secondary encoding.
- For auxiliary matching (products and services) we used a combination of NERC and Word embedding to create hashes for fields with products/services descriptions. The NERC block generated a list of relevant keywords per company, which we then passed on to the embedding engine. A Word2Vec wrapper computed a score for each word within the list and compared it element-wise with the items in pre-computed matrix to generate two outputs: Common elements from the paired list and common elements from the overlapping score.
- Processor for stage-2 pipeline: This consisted of a rules-management engine that performed basic matrix multiplication of the output matrices from stage-1 pipeline to generate the final hash for stage 2. This exercise tested for pairwise occurrences across the approaches and prepared a conjoint hash in 2 steps:
- If, for example, string A and B were paired as matches in cluster 2 in approach 1, cluster 4 in approach 2 and cluster 7 in approach 3, then the 1st hash was prepared as “247.”
- The final hash for each string was prepared using 2-bit combinations of the previous hash (e.g. ‘24’, ‘47’, ‘27’ in the above case)
- High-Precision Pipeline: Two ensembles ran in parallel for each of the hash obtained from the previous step. The number of parallel threads were dynamically adjusted in each run based on the number of unique hashes and the average size and variance of companies allocated across different hashes. The first ensemble leveraged a combination of four different edit methods, one chosen per string type based on the length of string, number of relevant words and script. The second ensemble leveraged a Machine Learning Pipeline and was used to test for disambiguation of false matches.
- Ensemble 1: We ran four different edit methods in parallel using the multiprocessing module, and computed the net score using a base-weighing approach formulated through the use of one or more of these logit functions:
- Jaro-Winkler
- Hamming
- Damerau-Levenshtein
- Levenshtein
- Post-processor: We used a dynamic threshold for each pair to come up with a strict, approximate or failed tag. The threshold was a function of cross-validation per batch and a randomly selected chunk that a small team then validated manually (within a month of our training the model, we were able to reduce the head count needed for this task from 4 to 1). For each pair, we multiplied scores from stages 1 and 2, and ensemble-1 and rank ordered their log scores. We created two boundaries (strict-to-approximate and approximate-to-fail) in the entire distribution, and passed through to ensemble-2 each entity in the “approximate” bucket.
- Ensemble-2: This was a statistical model layer we designed to perform a hard check on the approximate matches generated from the previous step. The ensemble used a combination of Support Vector Machines and Logistic Regression (based on string length) and yielded final outputs. We used several steps to train the model:
- First, we prepared a custom feature builder that generated more than 30 features using the labeled data for matching. We created the training data using seven datasets that allowed common referencing through the website name and email details (exact string match). We used approximately 11,000 companies with a balanced mix of issues (abbreviations, prefixes/suffixes etc.) to build the features and subsequently train the classification models. Some of the important features prepared in the exercise were:
- Edit distance scores
- Character ratios
- Detection of out-of-order components (PT Indonesia Batin Technology Pvt Ltd vs. Batin Technology, Indonesia, Pte Ltd., PT)
- Missing name (Binary flag)
- Primary and Secondary hashes based on Double Metaphone
- String length, number of distinct words
- Address length
- Truncated name (Binary flag)
- Root text in address (identified using a list of regions and matched using a list method)
- Model training: We prepared the model stage using Random Forest, GBM and XGBoost, with Hyperparameter Optimization, using the standard sklearn’s GridSearchCV. We tested different loss functions based on the script and text length. We made the final selection based on the average precision and its deviation across runs, altering the default scoring method in GridSearchCV to suit the business requirements.
Module 5: Feedback Layer: Since the pipeline encompassed both deterministic and probabilistic approaches of string matching, we needed a feedback layer that could advise the team on potential modifications and trigger alerts in case of severe imbalance. To do so, we created an automated descriptive reporting layer across three modules, and dispatched the reports to the team, which then reviewed those reports and tuned the pipeline parameters accordingly. The three modules included:
- Named Entity Recognition: Scheduled delivery of most imbalanced entities (similar probabilities across most varying classes) and volatile entities (entities that change their assigned class across consecutive runs).
- Clustering parameters: Performance summary across known matches and mismatches, along with a report on cluster performance metrics across repeated runs (perplexity, Silhouette, etc.).
- Thresholds for approximate and definite matches: List of most frequent terms at the edge of the threshold, and their scores across each individual model (method).
The final stack with workflow can be seen below (Fig. 1). Please note that, for confidentiality, we have hidden some of the additional pipelines such as ETL, normalization, and feature generation. A weekly schedule has been deployed for training the core models, with threshold tuning performed once every two weeks. The ruleset validation and re-tuning (whenever required) is done by the data scientists. The entire workflow is automated using Luigi. Memory management for large matrix computations is performed using Dask. Python’s core logging module is used along with smtplib to generate automated pipeline-performance reports. We also used Flask (a lightweight Python framework for API development) to create the serving layer that can handle thousands of requests per second (easily scalable to up to 7,000 queries per second in the current format), which represents a substantial volume compared to the typical demands of a corporate banking environment.
Figure 1 — Process architecture of the fuzzy matching ensemble
Concluding thoughts
In addition to dramatically increasing sales leads by a factor of 500, our design for the large-scale fuzzy name-matching engine also met our client’s goals in terms of both speed and accuracy. The design achieved a precision of 0.99 on the same-language test set and 0.96 on cross-lingual test set. In both instances, we captured a recall of over 0.92. In a typical daily scenario, the engine performs matching for more than 5 million strings in less than two hours, and while we didn’t perform a manual validation on content beyond 8,000 strings, our precision in each review has been from 94 to 99 percent.
In addition to these bottom-line benefits, our three-month client engagement also resulted in a number of insights that we believe are of great value to any company that is intent on solving the puzzle of Approximate String Matching or Fuzzy Name Matching.
- As clichéd as it may sound, no single name-matching method can address all the nuances found in textual data. Most methods will accomplish 80 percent of a design solution. It is the final 20 percent that requires experience and ingenuity.
- The road to such designs can be extremely challenging. At first, every thread in the pipeline seems daunting — the long run times to train the model and perform stress tests, the exceptions after every component in the design is changed, and the manual chore of going through the match lists to adjust thresholds and identify new methods to adopt. Surmounting these ongoing challenges requires considerable patience and dedication.
- It is critical to invest effort up front to design a good test/validation data set — one that encompasses all possible variations. It is equally important to maintain sufficient sample size to accommodate probabilistic methods.
- Managing small blocks such as stop words, threshold adjustment and clustering parameters takes significant mental energy. The boundaries where manual intervention is required are the most challenging. Successfully meeting this challenge requires a strong collaborative community of business stakeholders, native speakers and data scientists.
- Creating ranked lists is no small task. While in some cases it may be sufficient to identify the universe of all possible matches, the need for rank ordering takes the complexity a few notches higher. A prior understanding of cases in which ranked/unranked output was required can significantly improve the design process.
- Setting up automated tests at the start can save a great deal of subsequent effort. As observed in large software projects, complex ensemble methods like the one described above need significant structure and readiness across all pieces. Our design included more than 150 automated tests, ranging from small unit tests for assessing the language to large integration tests across engines. Even though tests can be set up at any time during the process, doing this work earlier in the process has tremendous benefits during debugging, scaling and user testing.
- Non-English strings (especially Chinese) were a bottleneck until the final week of development. Up until the end, we spent a significant amount of time implementing Chinese/Mandarin matching methods, testing bleeding edge libraries like Jieba, prepared large, custom word2vec embedding, and iterating over Google translate. These are valuable tools, but we note that there is huge potential to further improve the precision of all these methods.
References (provided wherever applicable in the texts)
- Algorithms for approximate string matching: https://www.sciencedirect.com/science/article/pii/S0019995885800462
- A guided tour to approximate string matching: https://www.dcc.uchile.cl/TR/1999/TR_DCC-1999-005.pdf
- Approximate String Matching Algorithms: A Brief Survey and Comparison: https://www.semanticscholar.org/paper/Approximate-String-Matching-Algorithms%3A-A-Brief-and-Hasan-Ahmed/209963725178129edba7463e2269bb4343758114
- A hybrid approach to fuzzy name search incorporating language-based and text-based principles : https://journals.sagepub.com/doi/10.1177/0165551506068146
- Indexing methods for approximate string matching: https://users.dcc.uchile.cl/~gnavarro/ps/deb01.pdf
- Natural language processing for Fuzzy String Matching with Python: https://towardsdatascience.com/natural-language-processing-for-fuzzy-string-matching-with-python-6632b7824c49
- An overview of fuzzy matching techniques: https://www.rosette.com/blog/overview-fuzzy-name-matching-techniques/
- A comparison of approximate string matching algorithms: http://www.cs.hut.fi/u/tarhio/papers/jtu.pdf
- Super-fast string matching in python: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html
- GridSearch CV: https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV
- Pre-trained vectors for 30+ languages: https://github.com/Kyubyong/wordvectors
- Jieba Chinese text segmentation: https://github.com/fxsjy/jieba
- Fast approximate string matching in a dictionary: https://users.dcc.uchile.cl/~gnavarro/ps/spire98.2.pdf
- Pre-trained ELMo Representations for Many Languages: https://github.com/HIT-SCIR/ELMoForManyLangs