What is a cosine similarity matrix?

Cosine similarity and its applications.

Vimarsh Karbhari
Jan 28 · 3 min read

Cosine similarity is a metric used to determine how similar two entities are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

When we say two vectors, they could be two product descriptions, two titles of articles or simply two arrays of words.

Mathematically, if ‘a’ and ‘b’ are two vectors, cosine equation gives the angle between the two.

Example:

Source: ML Solutions

This will give us the depiction below of different aspects of cosine similarity:

Source: ML Cosine Similarity for Vector space models.

Let us see how we can compute this using Python. We have the following five texts:

#Define Documents
Document_A: Alpine snow winter boots.

These could be product descriptions of a web catalog like Amazon. To compute the cosine similarity, you need the word count of the words in each document. We use the CountVectorizer or the TfidfVectorizer from scikit-learn.

# Scikit Learn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd

# Create the Document Term Matrix
count_vectorizer = CountVectorizer(stop_words='english')
count_vectorizer = CountVectorizer()
sparse_matrix = count_vectorizer.fit_transform(documents)

This is how we can find cosine similarity between different documents using Python.

Applications:

  1. As shown above, this could be used in a recommendation engine to recommend similar products/movies/shows/books.
  2. In Information retrieval, using weighted TF-IDF and cosine similarity is a very common technique to quickly retrieve documents similar to a search query.
  3. The cosine-similarity based locality-sensitive hashing technique increases the speed for matching DNA sequence data.

Important links for reference:

  1. Understanding TfidfVectorizer: Scikit-learn
  2. Spatial Distance Cosine: Scipy
  3. ML Book: ML Solutions

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

Acing AI

Acing AI provides analysis of AI companies and ways to venture into them.

Vimarsh Karbhari

Written by

Engineering Manager | Editor/Founder of Acing AI

Acing AI

Acing AI

Acing AI provides analysis of AI companies and ways to venture into them.

More From Medium

More from Acing AI

More on Artificial Intelligence from Acing AI

More on Artificial Intelligence from Acing AI

What is Least Angle Regression (LAR)?

49

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade