Build a Scalable Search Engine for Fuzzy String Matching

Christoph Ostertag
talentbase
Published in
6 min readJun 28, 2019
Source: https://emerj.com/wp-content/uploads/2018/04/3-waves-of-ai-transformation-in-industry-pattern-matching-ubiquitous-access-and-deductive-reasoning.png

We will discuss what fuzzy string matching is, what is it useful for, why it is not scalable by default, and how we can scale it to applications with millions of words to search among with smart simplifications. If you want to build a fuzzy search engine or understand how these problems can be tackled, this article is for you.

Intro

In fuzzy matching, our goal is to score string A to string B in terms of how close they are together. We want to find similar strings. For example, “mayor” could be very close to “major”, or something like “threat” very close to a typo like “thraet”, but also “Christoph Alexander Ostertag” could be very close to “Christoph Ostertag”. There exist a variety of algorithms for calculating those distances.

Levenshtein distance

The minimum number of replacements, delete, and insert operations needed to transform string A into string B. “Major” and “Mayor” would only need one string operation. Just switch out the “j” and “y”.

Jaccard distance

The number of matching subparts of string A and B. These subparts can be entire words or usually n-grams. N-grams are parts of a string consisting of n elements split up by a sliding window. “Major” would split up with an n of 3 into {“Maj”,”ajo”,”jor”}. Actually in this case every element would be different from “Mayor” that splits up into {“May”,”yao”,”aor”}. We could use a lower n to find similarities. However, this would mean we also match smaller parts of the strings that may appear more often in the set of strings.

(Remember the n-grams, they will be important later!)

Cosine similarity

This is also similar to the Jaccard distance. We split our strings up into n-grams as well. Once we did this, however, we first tokenize the n-grams and then vectorize the tokens. This could look somewhat like this.

  1. N-gram: “Major” -> {“Maj”, ”ajo”, ”jor”}
  2. Tokenize: {“Maj”, ”ajo”, ”jor”} -> {0, 2, 3}
    Example dictionary for the tokenizer:
    {“Maj” : 0, “ped” : 1, ”ajo” : 2, ”jor” : 3, …}
  3. Vectorize: This is the same thing as one hot encoding. The number of our tokens becomes the index of the vector. We count how often each word appears.
    {0, 2, 3} -> [1, 0, 1, 1]

Once we have vector representations, we normalize them. This means scaling the vectors to length 1. Then we calculate the angle between the vectors.

Cosine similarity. Source: https://www.oreilly.com/library/view/statistics-for-machine/9781788295758/assets/2b4a7a82-ad4c-4b2a-b808-e423a334de6f.png

One dirty little Problem — Scalability

Let's do a simple calculation. In our use case, we have a name database of 10 MB. Sounds not like big data, right? Now, let's say every word consists on average of 10 characters encoded by 1 byte each. This means we have a database that consists of 10 MB / (10 byte/word) = 1.000.000 words. Not so little anymore, but now it gets interesting. For each word, we want to find the closest words. To do this, we have to do string to string comparison of 1M strings to 1M strings. Imagine this as a matrix with the size of 1M * 1M entries.

Similarity Matrix. Source: https://kapilddatascience.files.wordpress.com/2016/05/san_francisco_sim_matrix.png

If we use 32-bit floating point numbers and three metrics per comparison, this adds up to 96 bits per comparison, or half of that because all the similarity metrics are unidirectional. Multiplying this by 1M*1M leads to a staggering 6 TB just to store this matrix. Calculating it is even more difficult. It comes down to 1.5 Trillion calculations. It would take months on a normal computer, like the one I am writing this article on, at the moment.

So what can we do? First we should look into the Big O notation.

This algorithm scales to the power of two. If we double the number of strings to compare, we quadruple the computing time. The big O notation is T(n) = O(n²). What we would like to have is O(n), where if we double the strings to compare, computing time also just doubles. This would however mean that we must compare our n strings with a constant number of other strings.

Can we exclude strings we do not want to compare? Why should we compare “Major” with “Burger King”? Does it make any sense?

If we want to do fewer comparisons, let's say comparing our 1M strings with only 100 strings, we would need to exclude 99.99% of the data. Excluding 99.99% of our data means also reducing computation time by 99.99%.

Build Lookup Dictionaries

Usually, we use indexing for fast search, but this works only when we compare something like (“Anastasia”, ”Student”) with (“Anastassia”, ”Student”). Then we can match on the exact same string “student” and find a link from Anastasia to Anastassia. But in our case, we assume we have no extra information.

How could we match Anastasia with Anastassia? Remember n-grams?
0: {“Ana”, ”nas”, ”ast”, ”sta”, ”tas”, ”ast”, ”sta”, ”tas”, ”asi”, ”sia”}
1: {“Ana”, ”nas”, ”ast”, ”sta”, ”tas”, ”ass”, ”sst”, ”sta”, ”tas”, ”asi”, ”sia”}
Close, right? Actually we match 9 n-grams exactly.

Let's just build a lookup dictionary then.

  1. First, we map all n-grams to all the indexes of the words:
    {“Ana” : [0, 1], …, “sst” : [1], “pap” : [1, 2], …}

2. Then build another lookup dictionary, now let's just match all indices to all other indices that appear together. This is also possible by looping once through the first dictionary. {0 : [0, 1], 1 : [0, 1, 2], 2 : [1, 2]}

Now we have to compare 0 only with 0 and 1, but not 2. And also we have to compare 2 only with 1 and 2. Of course, this computational saving is only minor. But if the average size of the matches now goes down to let's say 100 we save actually 99.99% of computing time for calculating our distances (Levenshtein, Jaccard, Cosine).

There are some different properties you can use for matching:

  • N-grams of different size lengths => n = {2, 3, 4, 5, 6, …}
  • Words from string => “Christoph” from “Christoph Ostertag”
  • Other properties from Name => Is student, has same customer ID, social security number, …
  • Industry/Topic Bucket => Has some words in common to a certain bucket: {basketball, tennis} both in sports

Practical considerations

  1. Search Depth:
    We can only compare our string to other strings that our string is in direct relation with, or we can increase our search depth to search for strings that have a relation with strings that our string has a relation to.
    (String A -> Strings (Depth 1)-> Strings (Depth 2) )
  2. Overmatching — Reducing the Connectivity of the Graph:
    If we match our strings with too many other strings, removing the most common n-grams or other properties we match on can be useful. Matching all “LLC” occurrences together might not be very useful, looking for the same limited liability company. Therefore, we just define how often a match is to be allowed or which percentage of most occurring matches we want to drop.

Conclusion

Making fuzzy matching useful is possible, we just have to think about how to make it scalable and what we actually want to accomplish with it. Do we just want to find typos, or do we want to match companies or people that could be the same? There is much we can do, and the discussion should be less on what is the best thing to do in general, but on what business case we are trying to solve.

If you appreciate the effort I put into this article and want to see more, please 👏 this story and consider following my account.

Some of my other articles that might interest you

Follow me on Medium to not miss out on any new posts on AI, machine learning, math and entrepreneurship! Christoph Ostertag

--

--

Christoph Ostertag
talentbase

Co-founder of talentbase. We help data science students to land their first job. https://www.talentbase.tech