Building NLP Content-Based Recommender Systems

A tutorial for a NLP recommendation engine using unsupervised learning

8 min readJul 7, 2019

Let’s understand how to do an approach for build recommender systems when you have text data.

Introduction

In this post we will be using datasets hosted by Kaggle and considering the content-based approach, we will be building job recommendation systems.

1. Getting Ready

For this post we will need Python 3.6, Spacy, NLTK and scikit-learn, If you do not have it yet, please install all of them.

2. The proccess

Here, we are using the data from this challenge on kaggle . The 4 datasets are as follows:

The Combined_Jobs_Final.csv file: has the main jobs data(title, description, company, etc.)
The Job_Views.csv file: the file with the jobs seeing for the user.
The Experience.csv: the file containing the experience from the user.
The Positions_Of_Interest.csv: contains the interest the user previously has manifested.

The process to build the recommeder systems is as follow:

The process start by cleaning and building the datasets, then get the numerical features from data, after that we will apply a similarity function( cosine similarity for example) to get the similarity between previous user jobs or jobs which user has manifested interest and the availables jobs, finally get the top recommend jobs according to the score of the similarity.

2.1 Building the Datasets

In every data project, the fist step is to explore and clean the data we have, also as there are 4 dataset we are going to merge them in order to have 1 dataset for jobs, and 1 dataset for users.

2.1.1 for jobs Dataset:

Reading the data and get the info about it

As we can see there are 23 columns, however for this article we only will use ‘Job.ID’, ‘Title’, ‘Position’, ‘Company’, ’City’, ’Job_Description’.

Then as part of the preprocessing we:

imputed the missing values if any.
remove stop words.
remove not alphanumeric characters.
lemmatize the columns.
finally we will merge all the columns in order to create a corpus of text for each job.

We put the step 2–5 into a function called “clean_txt”:

After made steps 1–5 we ended with a clean dataset with 2 columns: Job.ID and text (the corpus of the data) as we can see:

2.1.2 for users Dataset:

For the “jobs_views” dataset:

In this case we will use only the columns ‘Applicant.ID’, ‘Job.ID’, ‘Position’, ‘Company’,’City’, we select the columns and applying the clean_txt function we ended with an Id columns and a text column:

For the “experience” dataset:

For this file we only use the Position.Name and the Applicant.Id, we select the columns and clean the data, we ended we an ID and a text column:

for the position of interest dataset:

We are going to select Position.Of.Interest and Applicant.ID, we clean the data and ended with an Id column and a text column:

Finally we merge the 3 datset by the column Applicant.ID, the final dataset for user look like:

2.2 Extract features from text

We are going to use as features extractor both tfidf and countvectorizer to compare the recomemdations.

The code for tf-idf:

Please refers to this page for check more about tfidf implementation.

For CountVectorizer:

Please refers to this page for check more about count vectorizer implementation.

3. Recommender Systems

As this application has more textual data and there are no ratings available for any job, we are not using other matrix decomposition methods, such as SVD, or correlation coefficient-based methods, such as Pearsons’R correlation.

So we are only use content based filtering will show us how we can recommend items to people just based on the attributes of the items themselves.

In this post we are building 4 recommenders systems:

Content based Recomender with tfidf
Content Based Recomender with CountVectorizer
Content Based Recomender with Spacy
Content Based Recomender with KNN

Let’s start by thinking about how to measure the similarity between two jobs descriptions because we must find some sort of similarity measure that looks at how many in common have them.
So what’s a good way of doing that mathematically?: Cosine similarity

3.1 Cosine Similarity

Is the most common metric used to measure how similar the documents are irrespective of their size. Mathematically, it measures the cosine of the angle between two vectors projected in a multi-dimensional space.

For ilustrate the idea let’s check the charts below:

The jobs machine learning enginner and data Scientist are quite similar so
θ close to 0° (zero)
and cos(θ) close to 1.
(close to 1 more similar items)

The jobs bartender and data Scientist are not similar so
θ close to 90° and
cos(θ) close to 0(zero)

The general idea for this case is if the cosine is close to 1 the items are similar, if is close to 0 not similar, there is another case when cosine equal to -1 meaning similar but oposite items.

Please refer to this link tho review more about cosine similarity.

3.2 Content based Recomender with tfidf

For calculate the cosine similarity in python we will use cosine_similarity from sklearn package, the following code for a given user’s job ilustrated that.

Using tfidf:

In this, scores close to one means more similarity between items.

3.3 Content Based Recomender with CountVectorizer

using countvectorizer:

Again, scores close to one means more similarity between items.

3.4 Content Based Recomender using Spacy

For this we are not using cosine similarity but we will using pre-trained word vectors in Spacy, which can help to get better results, to compute similarity between the text.

First, for each text in jobs we need to build an spacy doc:

Then we use the spacy’s similarity function, which constructs sentence embedding by averaging the word embeddings and computing the similarity, the function below compute the similarity:

For spacy similarity, scores close to one means more similarity between items.

3.5 KNN Recomender System

The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other.

The code below, compute the 10 nearest neighbors for a given user job, using tfidf as features:

This is a particular case which scores close to zero means more similarity between items.

4 Evaluating the recomendations

As we build recommendations systems using TF-IDF, count vectorizer, cosine similarity, spacy etc, i.e using mainly text data and because there is not predefined testing matrix available for generating the accuracy score we need to check our recommendations relevance manually.

For test all the recommenders we selected random users from the user dataset:

As you can see we selected the user with Applicant_id 326, and text corpus related to java developer, let’s check the recommendations for this user:

4.1 Using TFIDF

The results:

The recommendation looks pretty good based in the data we have.

4.2 Using CountVectorizer

Pretty similar results compares to tfidf.

4.3 Using Spacy

Spacy uses vector embedding to compute similarity, this are the results:

In this case the results are not looking so much similar, the system recommend some magento and drupal jobs (mainly for php devs).

4.4 Using KNN

The top 10 recomendation is the table below:

You can see that, a little bit diferent from the previous recomendations in fact, the position 9 and 10 is like quite diferent(remember score close to 1 means totally diferent), so the system for this user only find 8 similar jobs.

Final Thoughts

Defining what makes a good recommendation is in itself, a complicated question and it’s important to decide what you’re optimizing for, because in a recommender system you care about your ability to show new things that users will love.

In this post we builded several contend-based recommender systems and for this particular case the recomendations based on cosine similarity seems to show the best results.

___________________________________________________

The code can be found on this Jupyter notebook, and you can browse for more projects on my Github.