Finding Similarities

--

Often in data science, we need to find the similarity between two or more string. For example, we might want to match the string of “Celtic won the cup” with “The cup was won by Celtic”. For we can use string similarity methods. The most common methods are:

  • Token. This involves finding similar blocks of text between two and uses this as a match. This method is strong when there is the same word within the two strings to be matched, but that they appear in different places. For example, it works very well for matching “Loans and Advances” with “Advances and Loans”.
  • Jaro. This method works by analysing the number of character transpositions requires between two strings.
  • QGrams. This method involves analysing the number of similarities in n-length character sequences between two strings. It is similar in its operation to the Edit distance but performs well when the same prefixes are used between strings.
  • Edit distance. This method analyses the number of changes required to go from one string to another. It works well when there number of changes between two strings, such as between “Loan and Account” and “Loans and Accounts”, which block methods often struggle with. Typically methods are Levenstein, Needleman-Wunch, Gotoh, Gotoh Window Affine, Jaro, Jaro Winkler, Qgram, Block, Cosine and Euclid.

--

--

Prof Bill Buchanan OBE FRSE
ASecuritySite: When Bob Met Alice

Professor of Cryptography. Serial innovator. Believer in fairness, justice & freedom. Based in Edinburgh. Old World Breaker. New World Creator. Building trust.