String Similarity Algorithms Compared

How we customised mail messages to users by choosing and implementing the most appropriate algorithm.

Our Case

The Context

Applications in a store
Top 3 most downloaded apps
Top 3 most downloaded apps with a filter on string similarity
  1. Twitter test
  2. Twitter
  3. Appaloosa Store’s Blog
  4. Kindle
  5. Amazon Kindle
  1. Appaloosa Store’s Blog
  2. Kindle
=> Apps ranked by number of downloads with a filter on different names.
  1. Twitter test
  2. Twitter
=> Apps only ranked by number of downloads.

Research And Tests

We tested several algorithms to compare strings, and selected the one that would better fit our need.

Algorithms testing table
  • the Levenshtein distance between “beauties” and “beautiful” is also 3
  • For us, humans, the “beauties”/”beautiful” pair is much more similar than the “foo”/”bar” pair. But the Levenshtein distance is the same.
The result is 1/7 = 14%
The result is now 2/9 = 22%
The result is 7/11 = 63%
Jaro distance formula
m = 6
t = 2/2 =1 (2 couples of non matching characters, the 4-th and 5-th) { t/h ; h/t }
|s1| = 6
|s2| = 6
dj = (⅓) ( 6/6 + 6/6 + (6–1)/6) = ⅓ 17/6 = 0,944Jaro distance = 94,4%
Jaro-Winkler distance formula
dw = 0,944 + ( (0,1*3)(1–0,944)) = 0,944 + 0,3*0,056 = 0,961Jaro-Winkler distance = 96,1%

The metric

We needed to define a threshold, the value at which we would consider two applications names so similar that we should only display one.

Simple & Secure Enterprise App Store @ www.appaloosa.io