# What is a cosine similarity matrix?

## Cosine similarity and its applications.

Jan 28 · 3 min read

Cosine similarity is a metric used to determine how similar two entities are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

When we say two vectors, they could be two product descriptions, two titles of articles or simply two arrays of words.

Mathematically, if ‘a’ and ‘b’ are two vectors, cosine equation gives the angle between the two.

Example:

This will give us the depiction below of different aspects of cosine similarity:

Let us see how we can compute this using Python. We have the following five texts:

`#Define DocumentsDocument_A: Alpine snow winter boots.Document_B: Snow winter jacket.Document C: Active swimming briefs.Document D: Active running shorts.Document E: Alpine winter gloves.`

These could be product descriptions of a web catalog like Amazon. To compute the cosine similarity, you need the word count of the words in each document. We use the `CountVectorizer` or the `TfidfVectorizer` from scikit-learn.

`# Scikit Learnfrom sklearn.feature_extraction.text import CountVectorizerimport pandas as pd# Create the Document Term Matrixcount_vectorizer = CountVectorizer(stop_words='english')count_vectorizer = CountVectorizer()sparse_matrix = count_vectorizer.fit_transform(documents)# Similarity between the first document (“Alpine snow winter boots”) with each of the other documents of the set:from sklearn.metrics.pairwise import cosine_similaritycosine_similarity(sparse_matrix[0:1], sparse_matrix)array([[ 0.50305744 ,  0.16651513,  0.62305744,  0.13448867]])`

This is how we can find cosine similarity between different documents using Python.

Applications:

1. As shown above, this could be used in a recommendation engine to recommend similar products/movies/shows/books.
2. In Information retrieval, using weighted TF-IDF and cosine similarity is a very common technique to quickly retrieve documents similar to a search query.
3. The cosine-similarity based locality-sensitive hashing technique increases the speed for matching DNA sequence data.

1. Understanding TfidfVectorizer: Scikit-learn
2. Spatial Distance Cosine: Scipy
3. ML Book: ML Solutions

Subscribe to our Acing AI newsletter, I promise not to spam and its FREE!

Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.

Written by

## More From Medium

#### More from Acing AI

Nov 13, 2018 · 4 min read

#### More from Acing AI

Jul 18, 2018 · 3 min read

### What is Least Angle Regression (LAR)?

Feb 19 · 3 min read

#### 49

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade