How to do a Content-Based Filtering using TF-IDF?

Ankur Dhuriya
Analytics Vidhya
Published in
4 min readNov 10, 2020

--

Content based filtering is about extracting knowledge from the content.

In a content-based Recommender system, keywords are used to describe the items and a user profile is built to indicate the type of item this user likes. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present).

  • Creates a user profile based on the previous interactions made by the user.
  • Creates a content analyzer based on the content. In a way it creates a profile for each item.
  • Retrieves items for user by comparing user profiles across different item profiles

TF-IDF

  • Stands for term frequency and inverse document frequency
  • These are the two matrices that are closely interrelated and search and figure out the relevancy of a given word to a document given a larger body of document so, for example, every article of Wikipedia might have a TF associated with it. Every page on the web could have a term frequency associated with it for every word that appears in that document.
  • All TF means is how often a given word occurs in a given document so within one web page one Wikipedia article , how common is a given word within that document , what is the ratio of that word occurrence rate throughout all the words in that document that’s it. TF just measures how often a word occurs in a document. A word that occurs frequently is probably important to that document’s meaning.
  • DF is how often a word occurs in an entire set of documents, i.e., all of Wikipedia or every web page. this tells us about common words that just appears everywhere no matter what the topic, like ‘a’, ‘the’, ‘and’, etc.
    Word with high TF and DF both might not be important measure relevancy of a word to a document.

So a measure of relevancy of a word to a document might be : TF/DF

Or : Term frequency * Inverse Document Frequency

That is, how often the word appears in a document, over how often it just appears everywhere. That gives you a measure of how important and unique this word is for this document.

  • We actually use the log of the IDF, since word frequencies are distributed exponentially. That gives us a better weighting of words overall popularity.
  • TF-IDF assumes a document is just a “bag of words”
  • Parsing document into a bag of words can be most of the work
  • Words can be represented as a hash value (number) for efficiency
  • What about synonyms? Various tenses? Abbreviations? Misspellings?
  • Doing this at scale is the hard part!

Applying TF-IDF for finding similar posts for users based on their social media activity

Data Set description -

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizertf = TfidfVectorizer(analyzer=’word’, ngram_range=(1, 2), min_df=0, stop_words=’english’)

Here TfidfVectorizer is used to create raw documents to a matrix of TF-IDF features. ngram_range=(1,2) means I only want unigrams and bigrams and min_df=0 means to take words in the feature vectors even if its frequency is just 1.

import matplotlib.pylab as plt
import scipy.sparse as sparse
tf_matrix1 = tf.fit_transform(df_posts[‘title’])
plt.spy(tf_matrix1)
tf_matrix2 = tf.fit_transform(df_posts[‘category’])
plt.spy(tf_matrix2)
from sklearn.metrics.pairwise import linear_kernel
csm1 = linear_kernel(tf_matrix1, tf_matrix1)
csm2 = linear_kernel(tf_matrix2, tf_matrix2)
csm_tf = (csm1 + csm2)/3

here linear_kernel calculate the cosine distance of each point in matrix one to every other point in the matrix

def cleanData(x):
if isinstance(x, list):
return str.lower(x)
else:
if isinstance(x, str):
return str.lower(x)
else:
return ‘’

def combine(x):
# new columns for algo application and to prevent affecting the original data
return x[‘title1’] + ‘ ‘ + x[‘category1’]
features = [‘title’, ‘category’]for feature in features:
df_posts[feature + ‘1’] = df_posts[feature].apply(cleanData)
df_posts[‘merged’] = df_posts.apply(combine, axis=1)count = CountVectorizer(stop_words=’english’)
count_matrix = count.fit_transform(df_posts[‘merged’])
csm_count = cosine_similarity(count_matrix, count_matrix)
# delete the new columns as processing is done on the merged column
df_posts.drop(columns=[‘title1’, ‘category1’, ‘merged’], inplace=True)
df_posts.drop(columns=’post_id’, inplace=True)
def recommend(post, csm=(csm_tf + csm_count)/2): # choosing this csm as it covers both aspects
idx = indices[post]
score_series = list(enumerate(csm[idx]))
score_series = sorted(score_series, key=lambda x: x[1], reverse=True)
score_series = score_series[1:11] # not recommending the original post itself, starting from 1
post_indices = [i[0] for i in score_series]
return df_posts.loc[post_indices].style.hide_index()

recommend function gives you the similar posts based on the post title you trying to search.

Link to code

https://github.com/ankurdhuriya/Content-Based-Recommendation-System

--

--