In Data Science, Similarity Matters!

--

We often are searching for something and don’t quite know the actual form that it might take, or have misspelt it. You often trust Google to do this for you, and when you search for Edinburger and Casle, Google will often know which words match best based on the similarity to indexed terms and the probability of the words in searches:

But, in cybersecurity, similarity can often detect our No 1 threat: spear phishing, as a spear phishing email will often be created from a template, and then customized for the target.

The Basics

Often in data science, we need to find the similarity between two or more string. For example, we might want to match the string of “Celtic won the cup” with “The cup was won by Celtic”. For we can use string similarity methods. The most common methods are:

  • Token. This involves finding similar blocks of text between two and uses this as a match. This method is strong when there is the same word within the two strings to be matched, but that they appear in different places. For example, it works very well for matching “Loans and Advances” with “Advances and Loans”.
  • Jaro. This method…

--

--

Prof Bill Buchanan OBE FRSE
ASecuritySite: When Bob Met Alice

Professor of Cryptography. Serial innovator. Believer in fairness, justice & freedom. Based in Edinburgh. Old World Breaker. New World Creator. Building trust.