String Matching and Database Merging


1. Are two strings the same?

Figure 1. Overview similarity measures
  • Levenshtein distance: Distance proportional number of single-character edits (i.e. insertions, deletions or substitutions) to convert one string into the other.
  • Jaro distance: Distance dependent on the number of matching characters and transpositions.
  • Jaro-Winkler distance: Similar, but different parts of the string weight different, resulting on a better performance for short strings.
  • Sørensen index or dice’s coefficient: Quotient between the number of characters shared and the lengths of the two strings.
  • Jaccard index: Equivalent to the dice’s coefficient. J = D/(2-D).
  • Dice’s coefficient: Quotient between the number of n-grams shared and the total number of n-grams in the two strings.
  • Cosine similarity: Each string is represented as a vector, where the dimensions correspond to the different possible words/n-grams. The dimensions are usually weighted with TF-IDF (term frequency–inverse document frequency). Words/n-grams that appear many times in a string, but only a few times in the set of strings are weight more. The cosine of the angle between the two vectors is the distance measure.

2. Combine measures with machine learning.

  • Unsupervised: The algorithm learns and predict at the same time. For example in clustering you give some points to the algorithm and a number of clusters and it divides the points in the best possible way.
  • Supervised: You need to train the algorithm to use it later. For example in classification you need to give them some examples already classified for the algorithm to be able to predict new values.
  • Support vector machine (SVM): Geometry based, tries to find a group of “planes” that divides our set of points accordingly to its class, in a way that the space between the points and the “planes” is maximum. The “planes” can be linear or not, depending on the kernel used (Figure 2). I used a linear kernel.
Fgure 2. Different kernels in Supoort vector classification. Source: sklearn
  • Logistic regression: Probabilistic model, it assumes a “S” shaped curve. Behaves similarly than LinearSVC in practice.

3. How to measure quality of prediction?

Figure 3. Quality measures
Figure 4. ROC curve


Figure 5. AUC and F1 for all methods
  1. Edit-based measures are generally bad
  • Makes sense. “Star Wars” and “Star Wars: Return of the Jedi” are very different if we attend only to characters.
  • The exception is Jaro-Winkler, which is really good. This is because the extra “Return of the Jedi” doesn’t have a lot of weight in this algorithm.
Figure 6. ROC curves
  • Using words works as well as N-grams.
  • Cosine similarity (TF-IDF) works a little better, especially if having a low false positive rate is very important for you (Fig. 6).
Figure 7. Weights of the features in the Machine Learning models
  • Expected, since they combine both measures.
  • Interestingly, they use some measures that are not the best by themselves. For example the Levenshtein distance, which is horrible by itself is needed for a good prediction (Fig. 7).




Data Juggler. Computational Scientist.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Javier GB

Javier GB

Data Juggler. Computational Scientist.

More from Medium

Which DDL operation to choose?

What to prepare for SQL Coding Interviews?

Working with dates on SQL — Practice Example 1 (Self Join)