Ronald Asseko
Apr 22 · 6 min read

Building A Recommender System (For Books)

Meet Nandu

What would happen if you get Harry Potter as a result when looking for a book on investment/finance on Amazon? Terrible match, right?

You talk to your friend who suggests posting your question on the subreddit r/booksugestions. One problem here is that most people in the same position don’t get a relevant answer, have to wait for some time, or never get an answer at all.

We ask questions about products of our interest all the time, but do we always receive relevant answers? What would happen if, when typing your question, we get topics, words, or products specific to our interest? Or after posting it, we receive a notification with relevant answers? Cool, right?

As a data scientist, let’s build a recommender system for books. More specifically, when someone is typing/asking for a recommendation on the subreddit r/booksuggestions, can we recommend books or similar books that other people have read/suggested?

Using Natural Language Processing & Neural Networks, let’s build a recommender system for books.

Part 1: Data Collection and Cleaning

Our work is in two parts; I first collect and clean the subreddit posts through api.pushshift.io; secondly, I explore and model the data. The data collected is from July 25th, 2010, to March 30, 2021. This amounts to 109,122 posts, which translates to about 37 million characters. Depending on the type of analysis we do (Tokenizer, TFIDF Count Vectorizer…), this can take up all our memory very quickly. A more advanced hardware accelerator (GPU or TPU) with High RAM (between 27GB and 32GB) on Google Collab-Pro crashed multiple times. The following is a screenshot of the code I used to access, download, and merge 1,092 batches of posts. In this part, I use Pandas, Os, Requests, And Re libraries.

Once we have put the data into a data frame, we filter the columns we’d like to work with and create a new column (text) that combines the titles (“title”) and comments (“selftext”). Any rows where there are no comments nor titles will have missing values (NaN), which we need to address. Here is the output:

Data frame of subreddit posts

From there, we are ready to start cleaning our text. It’s essential to know that there are no one-size-fits-all. The key is to look at some posts and find any pattern. For instance, we have a lot of newline escape sequences (“\n”). By just removing that, we can detect more entities (864 more books) in our posts. On the other hand, by eliminating special characters seen in text data (keep only numbers and letters), we detect fewer entities (3,041 fewer books).

Our text after cleaning

Part 2: Exploratory Data Analysis and Modeling

The second part involves using NLTK, Gensim, and spaCy libraries to process and understand the large volumes of text that we have. In this part, how well our text is clean determines the results that we get. Let’s now use our data to create tokens, vectors of words used within each post and then tag them. Here is the raw markup of our entity detection on one post.

Identity detection

Once we have each entity, we can filter out books and count their occurrences in each post. SpaCy identifies books as “WORK_OF_ART,” as shown below. One thing to note is misclassification. Some entities were misclassified, especially between books, persons, and organizations (companies, agencies, institutions). We can imagine a blurry line between entities that contain other entities. For instance, “The Poisonwood Bible,” which is a book by Barbara Kingsolver, was divided into two books: “The Poisonwood” and “Bible.” A good pattern for identifying an entity here is the use of upper cases. Whenever a word with an upper case follows another one without any punctuation in between, that sequence of words should be part of the same entity.

Summary of spaCy’s entity types (Source: SpaCy)

Here are our top 10 books:

Top 10 books with their count

Less than 13% of our posts had books mentioned. Yet, let’s use those as a basis for our recommender. For that, we use the Continuous Skip-gram model from Word2Vec, a two-layer neural network that takes in a vector of words (tokens we’ve created) and returns context words as output. This is where we train our model to understand text structures and content-language and develop a vocabulary—in other words, understanding the relationship between words by looking at their position and the number of occurrences within each sentence. Similar words tend to be used more frequently around the same words. Thus when words are represented as vectors, they will either point in the same direction (or not) if they are similar. A common technique to measure that similarity is the cosine similarity score, which ranges from 0 to 1 and measures the angle between vectors (in this case, words within each sentence). The smaller the angle, the more similar the words, the higher the cosine similarity score. Here is an excellent resource on cosine similarity scores if you’d like to read more on it.

Let’s now query our model as an example by typing in “Bible”:

The most similar word to “Bible” is “Quran,” and our cosine similarity score is 0.76.

Since 87% of our posts do not contain any books, let’s use our newly trained model to recommend some books to those individuals.

Creating a function that takes in words and returns three books

Let’s test our model by using a post without any books.

Input and Output of our model

Our model returns the books “The Twilight Zone,” “The Ritual,” and “Teeth” with scores of 0.989369, 0.988831, 0.988466, respectively.

Conclusion & Recommendation

Good news! We’ve successfully made our first recommender system. The main takeaways are:

  • The model is indeed able to recommend books based on context with high similarity scores (more than 0.98)
  • Be careful with preprocessing methods (how the data is cleaned)

Some use cases of our model are:

  • Giving relevant answers to queries (recommend movies on Netflix or videos on Youtube)
  • Prompting conversations around relevant subjects

An extension of this work could be to use other techniques in machine learning and neural networks with different parameters (neural network with more layers, for instance) to make our model more complex for other cool projects. Let me know If you have any suggestions or comments.

Source code can be found on my Github. Other sources include Susan Li and Rajat Jain.

Sign up for AI & ART

By MLearning.ai

A weekly collection of the best news and resources on AI & ART Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Ronald Asseko

Written by

Data Scientist | Econ

MLearning.ai

Data Scientists must think like an artist when finding a solution, when creating a piece of code.Artists enjoy working on interesting problems, even if there is no obvious answer.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store