String Similarity Algorithms Compared

Appaloosa Store
8 min readApr 5, 2018

--

How we customised mail messages to users by choosing and implementing the most appropriate algorithm.

Our Case

The Context

At Appaloosa, as a Mobile Application Management solution, we are used to dealing with a lot of applications. In this context, we face a particular situation. Some clients create multiple versions of the same app with slightly different names, for example: to easily distinguish between a production version, and a QA environment. Here is an example:

Applications in a store

Once a store is ready, the admin invites users by email to log in to the store and install applications on their device(s).

Some users never log in after receiving the first email invitation. We want to email those users to inform them that, by logging in, they could benefit from useful apps! We also want to show them the three most popular apps they would have access to.

The Problem

By selecting the three most downloaded apps, we may show an application with the same name several times unnecessarily. Listing the same application for various mobile OSes doesn’t add value to the users.

Consider this application list for the store:

Application / Number of Downloads

Twitter (for iOS)/ 250

Twitter test / 220

Twitter (for Android) / 210

Appaloosa Store’s Blog / 200

Kindle / 110

Amazon Kindle / 50

Today, with no string similarity comparator, the top 3 would be populated with different Twitter applications, as they are the most downloaded apps.

Top 3 most downloaded apps

Our objective is to give users some incentive to log in to the store. The user doesn’t need to know that “Twitter” (for iOS), “Twitter” (for Android) and “Twitter test” are the most downloaded apps on the store. The user has enough information knowing that “Twitter” is a popular app in his/her store and available.

So we would like to show the top 3 with a filter on string similarity, and have this result instead:

Top 3 most downloaded apps with a filter on string similarity

In our database, each application has a unique id. So in theory, each application is considered unique. But in practice, we end up having multiple versions of an application with very similar names.

Problem Summary

Given the following store’s applications list (the INPUT):

  1. Twitter
  2. Twitter test
  3. Twitter
  4. Appaloosa Store’s Blog
  5. Kindle
  6. Amazon Kindle

We would like to show our users the 3 most downloaded apps with different names. In this case, we’d like our OUTPUT to be as follows:

  1. Twitter
  2. Appaloosa Store’s Blog
  3. Kindle
=> Apps ranked by number of downloads with a filter on different names.

But this is the OUTPUT we got instead:

  1. Twitter
  2. Twitter test
  3. Twitter
=> Apps only ranked by number of downloads.

We needed a way to determine if two strings are similar.

The process would be to compare names in the order of the store’s application list:

“Twitter” and “Twitter” => SIMILAR so we only take one out of the two

“Twitter” and “Twitter test” => SIMILAR so we only take one out of the two

“Twitter” and “Appaloosa Store’s Blog” => DIFFERENT so we keep the two

There are several algorithms that can help us in this task and we tested some of them.

Research And Tests

We tested several algorithms to compare strings, and selected the one that would better fit our need.

Algorithms testing table

We compared String A and String B to have metrics on the different algorithms.

Rules for string similarity may differ from case to case. If you want to consider “niche” and “chien” similar, you’d use a string similarity algorithm that detects anagrams. Not in our case. An app named “niche” and another named “chien” are very likely to be two completely different apps.

After some brainstorming and research we came up with some algorithm methods that would help our case. The Cosine algorithm proved to be irrelevant for us, as for example it does not seem to take into account the letter order, which leads to an index of 1 (similar string) on an anagram (“niche” and “chien”).

Here is what we noticed:

Levenshtein Algorithm

The Levenshtein distance is the minimum number of single-character edits required to change one word into the other, so the result is a positive integer, sensitive to string length . Which make it more difficult to draw pattern.

For example,

  • the Levenshtein distance between “foo” and “bar” is 3
  • the Levenshtein distance between “beauties” and “beautiful” is also 3
  • For us, humans, the “beauties”/”beautiful” pair is much more similar than the “foo”/”bar” pair. But the Levenshtein distance is the same.

The metric we are using here is the inverse of the Levenshtein distance ( 1 / levenshtein_distance ) so the result is a percentage and is more easily read by us. But the problem mentioned above remained the same.

Where 1 is the result of comparing identical strings, “ShazamIphone” and “ShazamAndroid” have a similarity of 0.167. “chien” and “niche” have a similarity of 0.25.

So from the Levenshtein algorithm point of view Chien/Niche are more similar than “ShazamIphone”/“ShazamAndroid” because fewer edits are needed to get from “chien” to “niche”, than to get from “ShazamIphone” to “ShazamAndroid”.

Trigram Comparison

A trigram algorithm is a case of n-gram, a contiguous sequence of n (three, in this case) items from a given sample. In our case, an application name is a sample and a character is an item.
So The sequence “martha” has 4 trigrams { mar art rth tha }.

We can use the Trigram method to compare two strings.

Taking for example “martha” and the same word with a typo, “marhta”, and we can compute their trigrams:

Trigrams “martha”: { mar art rth tha }

Trigrams “marhta”: { mar arh rht hta }

To measure similarity we divide the number of matching trigrams in both strings: 1 { mar } by the number of unique trigrams: 7 { mar art rth tha arh rht hta }

The result is 1/7 = 14%

To balance the disadvantage of the outer characters (somewhat to reinforce the similarity of strings starting and ending with the same trigrams), we pad the string with blanks on either side resulting in these case in three more trigrams “_ma”, “ha_“ and “ta_”.

Trigrams “ martha ”: { _ma mar art rth tha ha_ }

Trigrams “ marhta ”: { _ma mar arh rht hta ta_ }

Having done that, the number of matching trigrams is up to: 2 { _ma mar }
The number of all unique trigrams: 9 { _ma mar art rth tha arh rht hta ha_ }

The result is now 2/9 = 22%

Using this method to compare “Twitter v1” and “Twitter v2” we have:

The number of matching trigrams: 7 { _tw twi wit itt tte ter er_ }
The number of all unique trigrams: 11 { tw twi wit itt tte ter er_ _v1 _v2 v1_ v2_ }

The result is 7/11 = 63%

The limit of the Trigram method to compare strings is that short strings with one (or two..) different trigrams tend to produce a lower similarity than long ones.

That is how we get a 0.2 similarity between “ShazamAndroid” and “ShazamIphone”, as they have more different trigrams.

The number of matching trigrams is: 5 { _sh sha haz aza zam }
The number of all unique trigrams: 20

As there is a strong dependency with string length, it does not yield a good comparison for us.

Jaro-Winkler Algorithm

“In computer science and statistics, the Jaro-Winkler distance is a string metric for measuring the edit distance between two sequences.

Informally, the Jaro distance between two words is the minimum number of single-character transpositions required to change one word into the other.

The Jaro-Winkler distance uses a prefix scale which gives more favourable ratings to strings that match from the beginning for a set prefix length”

Source: Wikipedia.

Giving “more importance” to words with identical prefixes made the Jaro-Winkler distance seem very interesting for our use case.

Starting from the beginning with the Jaro distance formula, here how it works. Don’t panic, we go step by step:

The Jaro Distance between two sequences s1 and s2 is defined by:

Jaro distance formula

dj is the Jaro distance
m is the number of matching characters (characters that appear in s1 and in s2)
t is half the number of transpositions (compare the i-th character of s1 and the i-th character of s2 divided by 2)
|s1| is the length of the first string
|s2| is the length of the second string

With a exemple. Let’s take “martha” and “marhta”.

m = 6
t = 2/2 =1 (2 couples of non matching characters, the 4-th and 5-th) { t/h ; h/t }
|s1| = 6
|s2| = 6

Just by replacing numbers is the formula, we get:

dj = (⅓) ( 6/6 + 6/6 + (6–1)/6) = ⅓ 17/6 = 0,944Jaro distance = 94,4%

Now we know what is the Jaro distance, let’s jump to the Jaro-Winkler distance.

The Jaro-Winkler similarity uses a prefix scale p which gives more favorable ratings to strings that match from the beginning for a set prefix length l.

p is a constant scaling factor for how much the score is adjusted upwards for having common prefixes. The standard value for this constant in Winkler’s work is p=0.1.

l is the length of common prefix at the start of the string (up to a maximum of 4 characters).

Jaro-Winkler distance formula

So back to the “martha”/ “marhta” example, let’s take a prefix lenght of l = 3 (which refers to “mar”). We get to:

dw = 0,944 + ( (0,1*3)(1–0,944)) = 0,944 + 0,3*0,056 = 0,961Jaro-Winkler distance = 96,1%

Using the JaroWinkler formula we go from the Jaro distance at 94% similarity to 96%.

In our case, most of similar apps start with the same prefix (“twitter v1” vs “twitter v2” or “ShazamIphone” vs “ShazamAndroid” etc. See the Algorithms testing table above). So it is an important criterion to take into account.

The metric

We needed to define a threshold, the value at which we would consider two applications names so similar that we should only display one.

Based on the results shown above and After running some tests on bigger samples of names, we refined the metric to 0.863.

For now we are more than satisfied, and we continue to monitor our users’ lists of Top 3 Apps, in case more work is needed. We’ll keep you posted ;)

--

--