Building a State of the Art Recommendation System from scratch — Content Based Filtering (Part 2)
This is the part 2 of building a recommendation system from scratch!. For part 1 please click on this link
1.2 Collaborative filtering based on items
Okay, so in the previous tutorial we have learned how to build a simple recommendation system using Collaborative Filtering on top of users.
Well, you can also use collaborative filtering on items/products too!
Its very simple and like the previous example, (movies recommendation based on watch/not watch) you need to find all the users who watched the movie and then find the most watched movies on those users after that.
Lets put that into example from the previous example dataset,
1> Let say, you want to recommend user similar movies to Movie 1, based on Collaborative Filtering on Items (using a very simple technique)
2> Firstly Find out how many users have watched that movie
3> Secondly then find out the most common movie by count, watched by those users.
Movie 2 — 2 count
Movie 3 — 1 count
Movie 4 — 3 count
Movie 5 — 2 count
Movie 6 — 0 count
Movie 7 — 4 count
Movie 8 — 2 count
Movie 9 — 1 count
Movie 10 — 3 count
Now you got the similar movies based on simple collaborative filtering on top of movies(items).
Now you can add your weight to get a more better results or simply arrange the list by count (Note some items have similar counts, so you can apply weights like avg watch time or number of likes or number of total ratings average).
So the output recommendation might be Movie 7, Movie 10, Movie 4 and son on…
You can also check this awesome article for item-item collaborative filtering algorithm
You can also check this article for a cosine similarity based collaborative filtering.
Note, the above algorithms are standard and widely used, but in my tutorial i am taking a different way to explain how to build recommendation system from scratch using some cool math logic's which works just as above but with better efficiency.
So this was it, but wait, whats next ?
Umm..!, there is a problem guys, a really interesting and big problem on using collaborative filtering ?
Did you notice that collaborative filtering is always depended on data!
All the above algorithms are biased and works based on the data given. The more accurate the data, the better the recommendation becomes.
But, what if you have no such data at the beginning ? or what if the user is new to the system and you haven’t gathered much data about him ?
How will our system recommend then ?
Challenging to think !
But, Here’s the catch!, that’s where Content based filtering comes in!
2.1 Content Based Filtering
Content-based filtering, also referred to as cognitive filtering, recommends items based on a comparison between the content of the items and a user profile. The content of each item is represented as a set of descriptors or terms, typically the words that occur in a document.
What do you mean by Content Similarity?
Car is very particularly similar to Toyota (The popular Car Brand)
where is Cat to Tom (The Famous Tom Cat from Cartoon Network!)
Or Annabelle is similar to Conjuring or partially to The Insidious (Horror + and the famous doll ghost)
Did you get it?
No more detailed user data is required to build a content based filtering recommendation system, you just need enough meta data about each items and thats it!
But the problem is how do you get match similar content (I mean i know that Toyota is a Car Company, or the famous Tom, is a Cat), but how does the computer gets to know the similarity!
Lets got through this step by step,
2.2> Lets check the percentage of similarity between 2 contents based on keyword matching.
Item 1: Laptop with Core 8th gen core i5 and 8gb ram from Dell with dedicated Nvidia Graphics Card.
Item 2: The all new mac book pro with 8th gen core i5 and 8gb ram
Item 3: The New Oneplus has just gone better with 8gb ram
Lets put this into simple keyword based matching !
1> Find all the keywords in item 1, 2, 3 and assign a count (using text frequency) or use any keyword extraction algorithm of your choice!
Extracted Keywords :
a> nvidia, graphics, laptop, card, dell, core, i5, 8gen, 8gb, ram
b> mac, 8gen, i5, core, 8gb, ram
c> Oneplus, 8gb, ram
So i simply extracted the top keywords (tokenized, stemmed and removed the stop words and verbs)
Now what we can do is simply match each document with other by number of common keywords
like, Compare(a, b) which is 5/12 (number of keywords matched by total number of unique keywords) = 0.41
secondly, Compare(a, c) which is 2/12(number of keywords matched by total number of unique keywords) = 0.16
Anyways, now you can clearly see that item a and item b are more similar then item c by 0.41 match ratio.
Similarly, you can use n-grams or Jaccard similarity / co-sine similarity and many other algorithm to match and compare string similarity between documents.
Simple right ? Check this then for a better and wise implementation of content based matching algorithm in python.
Okay, now also look into this and this examples!
We were pretty close, phew!, but this examples gave more better comparisons!
You know that just matching keywords or doing tf-idf will never give better recommendations to users, we need to thoroughly understand the context and semantics of each word and their relations with the documents, so what we do?.
Semantic Similarity?
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between them is based on the likeness of their meaning or semantic content as opposed to similarity which can be estimated regarding their syntactical representation
So, we need to implement that huh! well how to then ?
Well we can start by using wordnet/conceptnet datasets,
This will tell the synonyms, antonyms and relation between each words that of to other documents
Similarly we can use word2vec or Word 2 Vector for measuring semantic similarity.
What is word2vec?
Word2vector uses the concept that Similar words are used in similar contexts.
I can say that “I like playing football” or “I like playing Chess”.
You see even though the games are different but how the words ‘I like playing’ sits together before a game name.
Word2vector is a dataset that had already read millions of sentences and knows which words comes after what and in a vectorized format
Let say there is a sentence : I am Abhik Saha and i live in Mumbai.
So the vector for Abhik will be [0 0 1 0 0 0 0 0 0]
Simply but putting count of all the words in a sentence and then figuring out the position of the word in that sentence, this is also called one-hot encoding.
Now if i have 1 million words in a large paragraph then i will end up with a vector with 1 million in size,, so in future we need to squeeze them
We can use word2vector to find the most similar words near it or we can put 2 words and find the difference!
Check this http://bionlp-www.utu.fi/wv_demo/
We will them have them inserted as vectored word in a matrix.
Word analogy: An analogy identifies a similarity between like features of two different things by identify a relationship between a pair of words.
Like Dog is to puppy, cat is to kitten.
You can use word2vec combined with wordnet to find similar terms, there synonyms and similarity at word level and build a better, relevant and smarted content similarity engine.
_______________________________________________________________
So, after doing all the word2vector maths:
You will get a vector matrix [0.4455 0.344 ………. 0.554] after which you can either use co-sine similarity or euclidean distance to map out the similarity distance which will be your string/content similarity match percentage.
Simple semantic word2vec similarity using spacys build in library:
import spacy
nlp = spacy.load(‘en_core_web_lg’)
token1 = nlp(u’Jack and David are good Friends’)
token2 = nlp(u’David was travelling to New York’)
token3 = nlp(u’Jack told his friend David to help him his science project’)
print(“1, 2 documents similaity — “ + str(token1.similarity(token2)) )
print(“1, 3 documents similaity — “ + str(token1.similarity(token3)) )
Beside that you need to keep in mind important factors like movie genres, cast in movie recommendation, product cost, product details in eCommerce recommendation etc,
We will see how we can use simple Matrix Factorization for recommendation using factors in the next tutorial
Thank You,
Abhik Saha
https://theblockchainu.com