Content-Based Recommender Systems

Carlos Pinela
3 min readNov 28, 2017

--

After analysing User-Based and Item-Based Collaborative Filtering on my last post, which uses the interactions of the users with the different items to make recommendations, we realized that these systems have some problems:

  • Cold-start for new users.
  • New-item problem.
  • Sparcity.
  • Transparency.

Content-Based Recommender Systems are born from the idea of using the content of each item for recommending purposes, and trying to solve the problems describes above. Here are some pros and cons from this new method:

Pros:

  • Unlike Collaborative Filtering, if the items have sufficient descriptions, we avoid the “new item problem”.
  • Content representations are varied and they open up the options to use different approaches like: text processing techniques, the use of semantic information, inferences, etc…
  • It is easy to make a more transparent system: we use the same content to explain the recommendations.

Cons:

  • Content-Based RecSys tend to over-specialization: they will recommend items similar to those already consumed, with a tendecy of creating a “filter bubble”.
  • The methods based on Collaborative Filtering have shown to be, empirically, more precise when generating recommendations.

Architecture of a Content-Based Recommender System

The three principal components are:

  • A Content Analyzer, that give us a classification of the items, using some sort of representation (more of this later on this post).
  • A Profile Learner, that makes a profile that represents each user’s preferences.
  • A Filtering Component, that takes all the inputs and generates the list of recommendations for each user.

How to represent the content?

The content of an item is a very abstract thing and gives us a lot of options. We could use a lot of different variables. For example, for a book we could consider the author, the genre, the text of the book itself… the list goes on.

When we know which content we will consider. We need to transform all this data into a Vector Space Model, an algebraic representation of text documents.

Generally, we do this with a Bag of Words model, that represents documents ignoring the order of the words. In this model, each document looks like a bag containing some words. Therefore, this method allows word modeling based on dictionaries, where each bag contains a few words from the dictionary.

A specific implementation of a Bag of Words is the TF-IDF representation, where TF is for Term Frequency and IDF is Inverse Document Frequency. This model combines how important is the word in the document (local importance), with how important is the word in the corpus (global importance).

Different Components of a TF-IDF representation

Do you want to learn more?

That was the general aspect of Content-Based Recommender Systems.

We should awknowledge that Bag of Words representation does not consider the context of words. If it is important for us to capture that, Semantic Content Representation becomes important. I present the two options one has, in case you want to know more about it.

Option 1 — Explicit Semantic Representation:

Option 2 — Infer Semantic Representation:

LSA and LDI

See you on the next post.

--

--

Carlos Pinela

Data Enthusiast. Engineer. Currently focused on batch and real-time data pipelines. Driven by collaboration. Passionate about music and literature. From Chile.