Introduction to Similarity Metrics

Published in

Analytics Vidhya

3 min readDec 30, 2019

In this blog we are discussing about “ how two sentences are similar”. In that context, we are looking at Jaccard similarity as well as cosine similarity.

Before going to see similarity matrix you need to have some knowledge about word embedding. Here i explaining a little bit summary for word embeddings

Types Of Word Embedding:-

Word embedding is natural language modelling technique for mapping word to vectors of real number. Following are some word embedding techniques:-

Counter-Vectorizer:-The idea is to collect set of words , sentences, documents or paragraphs to count occurrence of each word in document. It is also called as one-hot encoding. this is not maintaining the semantic meaning
TF-IDF:- TF-IDF stands for term frequency and inverse document frequency.It is often used as information retrieval and text mining. This is not maintaining the semantic meaning
Word2Vec:- word2vec is most popular technique developed by google in 2013. It maintain the semantic meaning of the document.

Jaccard Similarity:-

Jaccard similarity index is also called as jaccard similarity coefficient. It measures the similarity between two sets. The range is 0 to 100%. The more percentage then more similar two word.

The demo code for calculating jaccard similarity between two sentence is as follows:-

Code for Jaccard Similarity Between Two Sentences

The above example is a simple code example which is used to find out the similarity between two sentences. If You have large amount of data then its better to use Word-Embending techniques as discussed above.Note that you have done the lemmatization before passing sentence to jaccard similarity.

If the length of the documents is large and even we get some same words then jaccard distance is not doing well.

Cosine Similarity:-

The cosine similarity is measure the cosine angle between the two vectors. For cosien we have to convert all sentences to vectors. For converting to vector we can use TF-IDF, Word2Vec.

The formula for cosine similarity is:-

The code example is as follows:-

Cosin similarity

For better results you can make a vectors using glove models which maintains the semantic meaning. Normalizing the vectors is also better for improvement in results.

Major difference between jaccard and cosine similarity:-

Jaccard Similarity takes set of unique length of words instead cosine similarity takes whole sentence vector
If data duplication is not matter then its better to use jaccard similarity else cosine similarity is good for measuring the similarity between two vectors even if the data duplication is there.

Also see my previous blogs

References:-

Applied Course

We know how challenging changing careers can be. Our Applied AI/Machine Learning Courses are designed as whole learning…

www.appliedaicourse.com

Overview of Text Similarity Metrics in Python

While working on natural language models for search engines, I have frequently asked questions “How similar are these…

towardsdatascience.com

Text Similarities : Estimate the degree of similarity between two texts

Note to the reader: Python code is shared at the end

medium.com