Building Book Recommendation System

Word2vec model in Content-based Recommendation

ashok .c
8 min readDec 10, 2021
Photo by freestocks on Unsplash

A recommender system is not uncommon in today's world of immense data. Almost all big giant companies in the e-commerce, entertainment industry integrated recommender systems to their websites, apps to facilitate the customer to find out their favorite products, movies, music with so much ease.

Content-based recommender and Collaborative based recommender are the two popular approaches in recommendation systems. In this blog, I show you my end-to-end project on a book recommendation system by using the word2vec model in a content-based recommender system using the scraped dataset from Goodreads websites which I found on google.

Before that, Here is the link for my web app on the book recommendation system. Link

CONTENT-BASED RECOMMENDER SYSTEM

A content-based recommendation system recommends items or products to a user by considering the similarities between the items which the user likes and the other items. It measures the similarity between the products which customers like and the remaining products and finally selects the products which are very similar to the user's favorite.

Similarity measures are a mathematical formula to calculate the distance between two points. It can also apply to calculate the distance between two multidimensional points. There are different similarity measures are available. EUCLIDEAN DISTANCE similarity measures were used in our book recommendation project. For more details on the similarity measure, please refer to this article.

Have a look at the dataset shown below. In the GoodReads dataset, there was a feature named Desc which describes a brief summary of each book. This is an important feature because this acts as content to differentiate the similarity between each book.

The next important thing is to convert this text feature(Desc) into a numerical vector then the only calculating similarity is possible. This is known as WORD EMBEDDING in Natural language processing. But before that, we need to do a lot of preprocessing steps for the text feature such as removing stopwords, punctuation then Lemmatization, and converting them to lower case. Below is the code for preprocessing the text features.

WORD EMBEDDING

There are many approaches for converting text into numerical vectors. The most popular approach is Binary Bag-of-words, Count-based Bag-of-words, TF-IDF, and Word2vec model. In this project, I have used the word2vec model which performs better than the other approach.

The problem with the bag-of-words and TF-IDF model is this does not capture the semantic meaning of the sequence of text. Because bag-of-words is just counting of words and then tf-idf is just a mathematical formula of multiplying term frequency and inverse document frequency. It does not capture semantic meaning. There will be no co-relation in the similarity between similar words. These models give two different numeric vectors for the two words which have the same meaning. And also these approaches give very sparse matrices.

WORD2VEC

Word2vec is one of the significant breakthroughs in word embedding because it uses a neural network architecture. What it does is, given a word it returns a dense vector that is semantically similar for similar words. This is what we want in this project.

Before we understand its architecture, we need to learn the core idea of what is context words and focus words.

In the above sentences, if we take the word “sat” as a focus word then the remaining words around the focus words are the context words. If we take the “cat” as a focus word then the remaining words around the “cat” are context words. You can take any word as a focus word in each sentence. The reason for this segregation is context words are most useful in understanding the focus words and vice versa. If the model understand the context words then it would have been able to find that what the focus words would be. The number of context words to consider is fixed is called context window size. If the context window size is 5 then it takes only the five surrounding words to the focus word.

Word2Vec has two model architectures variants 1) Continuous Bag-of-Words (CBOW) and another is SkipGram. There is a slight difference in the architecture of both variants and also the core idea between them.

CBOW

The core idea of the CBOW model is to given the context words can the model able to predict the focus word. Below is the CBOW model architecture

Photo by analyticsindiamagazine

HOW DOES TRAINING A CBOW WORKS?

In the Goodreads dataset, we will be using only the text features. The most important text feature is a description of each book column(desc). From this description column repository, we need to create a separate dataset that contains focus words as one column and the context words as a separate column. For each row in the description column, you can have many combinations of focus words and context words. This should be repeated for all the rows of the description dataset. So this way we have to create the dataset for training the CBOW model. The focus words columns will be target variables and the context words column will be an input variable.

Let's consider ‘V’ as the total number of words in the dataset. We need to convert each word into a numeric vector by using a one-hot encoding method. So each word has a V dimensional numeric vector.

Below is the CBOW model architecture. It is a shallow neural network model with a single hidden layer.

Photo by Lil’Log

The input layer is the k context words of each row, where “k” is the window size and the output is a focus word. Each context word is a V dimensional vector. The middle layer is a single hidden layer that contains “N” neurons which is a hyperparameter based on a number of dimensions we want for each word. If we want 200 dimensions numeric vector for each word, then “N” should be 200.

Then the activation function we used in a hidden layer is a linear activation function that just sums the weights and bias from each of the boxes of the input layer. The output layer has a softmax function because predicting focus words is nothing a multi-class classification problem because we are trying to predict one word from the repository consisting of ‘V’ focus words. So softmax function has V number of neurons because each word is represented as a ‘V’ dimensional vector

At the end of the training, we have optimal weights on each layer. The weight matrix size between the projection(hidden) layer and the output layer was N*V. So, whenever given a word there might be a corresponding column vector in the N*V weight matrix.

So by using this intermediate weight matrix, given any word there going to be a corresponding column vector with the size “N”. So we can use this as a representation of a numeric vector for each word.

The numeric vector for each word is gonna be similar for similar words. This is the working of CBOW.

To train this CBOW, we need a large amount of corpus data, then only word2vec will be efficient. In this project, we don’t have enough data corpus to train a word2vec algorithm. So alternative solution is using the GOOGLE PRETRAINED WORD2VEC MODEL. As similar to this, we can get many pre-trained word embedding that was trained on huge data by many others like Stanford NLP, Facebook.

Pre-trained word2Vec using Gensim library

We can download the google pre-trained word2vec model by using the gensim library as code shown below.

Its file size is 3.3Gb, once you load this into your memory, it occupies 9Gb of RAM, so you need at least 12G of ram to process it.

HOW TO RECOMMEND BOOKS USING WORD2VEC?

So what word2vec does is given each word it gives a numeric dense vector of 300 dimensions. It is a large collection of key-value pairs, where keys are the words in the whole training dataset and values are their corresponding word vectors.

AVERAGE WORD2VEC

In our project, we need to convert the descriptions of each book into a numeric vector and then find the similarity between these vectors to recommend the book. Each book description contains many sentences and each sentence contains sequences of words. So each word has its own numeric vector. So we need to convert a many-word vector into an overall single-word vector of 300 dimensions for each book description. To do that just add all the numeric vectors for each book description and then divide the final vector by the number of words in each book description. This method is called Average Word2vec.

So, now each book description can contain a unique numeric vector of 300 dimensions with this finding similarity between numeric vectors is simple. Similarity measures such as EUCLIDEAN DISTANCE SIMILARITY or COSINE SIMILARITY can be used to find similarities.

Let’s take some random description examples in our dataset

Book title: The Four Pillars of Investing’

Book Description:william bernstein american financial theorist neurologist research field modern portfolio theory research financial books individual investors wish manage equity field lives portland oregon

As I said earlier, We have to convert the above description into vectors. As you know, word2vec takes the word and gives a d-dimension vector. First, we need to split the sentences into words and find the vectors representation for each word in the sentence.

The above example has 23 words. Let’s denote the words as w1, w2, w3, w4 …w23. (w1 = William, w2 = bernstein ………. w23 = Oregon). Calculate Word2Vec for all 23 words. Then, sum all the vectors and divide the same by a total number of words in the description (n).

This is how we calculate the average word2vec. In the same way, the other book descriptions can be converted into vectors. The code given below.

Top 5 Recommendation using Average Word2Vec

Below are the screenshots of the word2vec based book recommendation web app based on the given book description by users.

If the book title “THE FOUR PILLARS OF INVESTING” was selected, then that book description was converted into numeric vectors by using average word2vec and then find similarities between other remaining word vectors of other books. Whichever book has the closest similarity will be selected.

Here is the link for my web app on the book recommendation system. Link

Github repo for this project: link

Demo video of this project: link

Thanks for reading.

--

--