Building A Recommender System (For Books)
What would happen if you get Harry Potter as a result when looking for a book on investment/finance on Amazon? Terrible match, right?
You talk to your friend who suggests posting your question on the subreddit r/booksugestions. One problem here is that most people in the same position don’t get a relevant answer, have to wait for some time, or never get an answer at all.
We ask questions about products of our interest all the time, but do we always receive relevant answers? What would happen if, when typing your question, we get topics, words, or products specific to our interest? Or after posting it, we receive a notification with relevant answers? Cool, right?
As a data scientist, let’s build a recommender system for books. More specifically, when someone is typing/asking for a recommendation on the subreddit r/booksuggestions, can we recommend books or similar books that other people have read/suggested?
Using Natural Language Processing & Neural Networks, let’s build a recommender system for books.
Part 1: Data Collection and Cleaning
Our work is in two parts; I first collect and clean the subreddit posts through api.pushshift.io; secondly, I explore and model the data. The data collected is from July 25th, 2010, to March 30, 2021. This amounts to 109,122 posts, which translates to about 37 million characters. Depending on the type of analysis we do (Tokenizer, TFIDF Count Vectorizer…), this can take up all our memory very quickly. A more advanced hardware accelerator (GPU or TPU) with High RAM (between 27GB and 32GB) on Google Collab-Pro crashed multiple times. The following is a screenshot of the code I used to access, download, and merge 1,092 batches of posts. In this part, I use Pandas, Os, Requests, And Re libraries.
Once we have put the data into a data frame, we filter the columns we’d like to work with and create a new column (text) that combines the titles (“title”) and comments (“selftext”). Any rows where there are no comments nor titles will have missing values (NaN), which we need to address. Here is the output:
From there, we are ready to start cleaning our text. It’s essential to know that there are no one-size-fits-all. The key is to look at some posts and find any pattern. For instance, we have a lot of newline escape sequences (“\n”). By just removing that, we can detect more entities (864 more books) in our posts. On the other hand, by eliminating special characters seen in text data (keep only numbers and letters), we detect fewer entities (3,041 fewer books).
Part 2: Exploratory Data Analysis and Modeling
The second part involves using NLTK, Gensim, and spaCy libraries to process and understand the large volumes of text that we have. In this part, how well our text is clean determines the results that we get. Let’s now use our data to create tokens, vectors of words used within each post and then tag them. Here is the raw markup of our entity detection on one post.
Once we have each entity, we can filter out books and count their occurrences in each post. SpaCy identifies books as “WORK_OF_ART,” as shown below. One thing to note is misclassification. Some entities were misclassified, especially between books, persons, and organizations (companies, agencies, institutions). We can imagine a blurry line between entities that contain other entities. For instance, “The Poisonwood Bible,” which is a book by Barbara Kingsolver, was divided into two books: “The Poisonwood” and “Bible.” A good pattern for identifying an entity here is the use of upper cases. Whenever a word with an upper case follows another one without any punctuation in between, that sequence of words should be part of the same entity.
Here are our top 10 books:
Less than 13% of our posts had books mentioned. Yet, let’s use those as a basis for our recommender. For that, we use the Continuous Skip-gram model from Word2Vec, a two-layer neural network that takes in a vector of words (tokens we’ve created) and returns context words as output. This is where we train our model to understand text structures and content-language and develop a vocabulary—in other words, understanding the relationship between words by looking at their position and the number of occurrences within each sentence. Similar words tend to be used more frequently around the same words. Thus when words are represented as vectors, they will either point in the same direction (or not) if they are similar. A common technique to measure that similarity is the cosine similarity score, which ranges from 0 to 1 and measures the angle between vectors (in this case, words within each sentence). The smaller the angle, the more similar the words, the higher the cosine similarity score. Here is an excellent resource on cosine similarity scores if you’d like to read more on it.
Let’s now query our model as an example by typing in “Bible”:
The most similar word to “Bible” is “Quran,” and our cosine similarity score is 0.76.
Since 87% of our posts do not contain any books, let’s use our newly trained model to recommend some books to those individuals.
Let’s test our model by using a post without any books.
Our model returns the books “The Twilight Zone,” “The Ritual,” and “Teeth” with scores of 0.989369, 0.988831, 0.988466, respectively.
Conclusion & Recommendation
Good news! We’ve successfully made our first recommender system. The main takeaways are:
- The model is indeed able to recommend books based on context with high similarity scores (more than 0.98)
- Be careful with preprocessing methods (how the data is cleaned)
Some use cases of our model are:
- Giving relevant answers to queries (recommend movies on Netflix or videos on Youtube)
- Prompting conversations around relevant subjects
An extension of this work could be to use other techniques in machine learning and neural networks with different parameters (neural network with more layers, for instance) to make our model more complex for other cool projects. Let me know If you have any suggestions or comments.