Building a recommendation system

Content-based recommendation using cosine similarity

Joshua Szymanowski
6 min readSep 3, 2020

Move over GoodReads! There’s a disruptor in town, known as PO-REC, your friendly neighborhood poem-recommending robot. (Just kidding, GoodReads. Feel free to make PO-REC an offer).

The preparation

First, you’ll need to get your data in good order. Basically all you’ll need is a DataFrame with all numerical values and without any NaN values. For my project I used seven features I engineered based on the structure and form of poems, as well as 100-dimensional document vectors created using a Doc2Vec model.

Refer to my previous article for some nitty gritty details about the project. For PO-REC, the seven features I used were:

  • Length of poem (number of lines)
  • Length of line (average number of words per line)
  • Sentiment, polarity score (a measure of positivity, negativity, or neutrality)
  • Sentiment, subjectivity score (a measure of…subjectivity)
  • End rhyme ratio (number of end rhymes to number of lines)
  • Word complexity (average number of syllables per word)
  • Lexical richness (unique words divided by total words)

So I ended up with 107-dimensional “poem vectors”. This may sound more complicated than it actually is — any dataset that happens to have 107 numerical features (i.e. columns in a DataFrame) could be considered a group of data points with 107-dimensional vectors.

But why all this talk of vectors? Well, it’s important for a pillar of content-based recommendation systems: cosine similarity.

Cosine similarity

Cosine similarity is simply a measure of the angle between two vectors. A smaller angle results in a larger cosine value. Thus, the smaller the angle, the more similar the vectors. The image below gives you a sense of what different angles mean in terms of similarity.

(source)

Cosine values range from -1 to 1. Vectors that run in completely opposite directions of each other (a 180-degree angle) have a value of -1; and vectors that run in the same exact direction (a 0-degree angle) have a value of 1. In the image above, the left-most angle has a cosine value close to 1, the middle a value around 0, and the right-most a value nearing -1.

But what about magnitude?

I know what you’re thinking: this doesn’t take magnitude into account. And while that’s true, for many use cases (and for most recommendation systems) magnitude typically doesn’t matter as much as direction. I recall an example of comparing three grocery orders. Person A buys 1 egg, 1 grapefruit, and 1 steak. Person B buys 1 tofu slab (extra firm), 1 bag of chips, and 1 bag of rice. Person C 100 tofu slabs (extra firm), 100 bags of chips, and 100 bags of rice. And the question is: whose orders are more similar?

Cosine similarity will tell you that Person B and Person C are as similar as it can get. They have the same angle and cosine value of 1.

This sad old man —

Euclid (o.g. sadboi)

— will tell you that the meat-eater and the vegan have much more similar orders because they ordered the same amount of things. In other words, the endpoints of the A and C vectors are far closer than the endpoints of the B and C vectors. If one measures similarity using Euclidean distance, one favors magnitude over direction. It can easily be argued that both matter, so it really does depend on your use case.

In my case, as is true of most text-based projects, direction matters more. For example, a short poem about death is more similar to a long poem about death than it is to a short poem about water. That said, you can include measures of magnitude within each vector, as I described above, which will change the angle of that vector accordingly.

Implementation

While it is difficult to describe (and impossible to envision) the angle between two 107-dimensional vectors, one can still calculate that angle using the dot product. This is easily done with scikit-learn’s cosine_similarity function:

from sklearn.metrics.pairwise import cosine_similaritysimilarities = cosine_similarity(your_dataframe)

This will return a cosine similarity matrix, which is basically an array of arrays with each text’s similarity to all other texts. The shape of the matrix will be the length of your DataFrame by the length of your DataFrame. It’s important to note that this includes each text’s similarity with itself, which will always equal 1. If you want to return a list of most similar items, you’d most likely want to exclude that value.

A problem emerges

Despite the fact that the cosine_similarity function runs in an instant, the resulting matrix can get rather large rather quickly. If you’re trying to host a recommendation system on something like Heroku, as I was, you can’t exactly upload a several hundred megabyte file. So for me, it was best to calculate the cosine similarity one text at a time, as needed. The code for that was something more like:

similarities = enumerate(cosine_similarity(
your_dataframe.iloc[text_id].values.reshape(1,-1),
your_dataframe)[0]
)

Disregarding the enumerate portion for now, you want to use one text as an input and compare it to all of the texts. Using iloc grabs the desired text, values converts it into an array, and reshape gets it into the desired shape. Pulling a value from a DataFrame results in a one-dimensional vector, but you need an n-dimensional vector that rotates your vector to match the shape of your DataFrame, which is what reshape(1,-1) does. The second input is simply the entire DataFrame or whatever you wish to compare the text to. Using the 0 index returns the similarities as a single list, as opposed to a nested list with a length of 1.

Why enumerate? The enumerate portion is indeed optional, but because I needed to sort from most to least similar and return a specific poem, I needed to make sure I tracked which poem corresponded to which similarity measure, and enumerate provides that index number (assuming you have indices that range from 0 to the length of your DataFrame).

Sorting hint
To sort the enumerate function’s resulting tuples by the cosine value (rather than the index number), you can use the following code:

from operator import itemgettersimilar_texts = sorted(similarities, 
key=itemgetter(1),
reverse=True)

You can also use a lambda, but from what I’ve read, itemgetter is faster. Lastly, reverse=True simply sorts in descending order.

Recommend away!

With the sorted list of cosine similarities, you can build your recommendation system. Give your uses the top 10, top 20, top 100 most similar data points, or allow the user to put in their own value as I did.

And for Python-based coders, Streamlit is your friend. It’s incredibly easy to use, and even though it lacks some customizability, it is perfect for a quick demo.

Project repo

You can check out my Heroku deployment of PO-REC, or look at my project repo on GitHub (where you can clone and run the app locally):
https://github.com/p-szymo/poetry_genre_classifier

Python In Plain English

Did you know that we have three publications and a YouTube channel? Find links to everything at plainenglish.io!

--

--