Introduction to Similarity Metrics

Murli Jadhav
Analytics Vidhya
Published in
3 min readDec 30, 2019

In this blog we are discussing about “ how two sentences are similar”. In that context, we are looking at Jaccard similarity as well as cosine similarity.

Fig=Book Shelf (find_here)

Before going to see similarity matrix you need to have some knowledge about word embedding. Here i explaining a little bit summary for word embeddings

Types Of Word Embedding:-

Word embedding is natural language modelling technique for mapping word to vectors of real number. Following are some word embedding techniques:-

  1. Counter-Vectorizer:-The idea is to collect set of words , sentences, documents or paragraphs to count occurrence of each word in document. It is also called as one-hot encoding. this is not maintaining the semantic meaning
  2. TF-IDF:- TF-IDF stands for term frequency and inverse document frequency.It is often used as information retrieval and text mining. This is not maintaining the semantic meaning
  3. Word2Vec:- word2vec is most popular technique developed by google in 2013. It maintain the semantic meaning of the document.

Jaccard Similarity:-

Jaccard similarity index is also called as jaccard similarity coefficient. It measures the similarity between two sets. The range is 0 to 100%. The more percentage then more similar two word.

Fig-Formula for jaccard similarity

The demo code for calculating jaccard similarity between two sentence is as follows:-

Code for Jaccard Similarity Between Two Sentences

The above example is a simple code example which is used to find out the similarity between two sentences. If You have large amount of data then its better to use Word-Embending techniques as discussed above.Note that you have done the lemmatization before passing sentence to jaccard similarity.

If the length of the documents is large and even we get some same words then jaccard distance is not doing well.

Cosine Similarity:-

The cosine similarity is measure the cosine angle between the two vectors. For cosien we have to convert all sentences to vectors. For converting to vector we can use TF-IDF, Word2Vec.

The formula for cosine similarity is:-

Fig-Cosine Similarity formula

The code example is as follows:-

Cosin similarity

For better results you can make a vectors using glove models which maintains the semantic meaning. Normalizing the vectors is also better for improvement in results.

Major difference between jaccard and cosine similarity:-

  1. Jaccard Similarity takes set of unique length of words instead cosine similarity takes whole sentence vector
  2. If data duplication is not matter then its better to use jaccard similarity else cosine similarity is good for measuring the similarity between two vectors even if the data duplication is there.

--

--