Introduction to Recommendation System — Part 1
Have you ever seen
- Your parents just start using Facebook and make friends with some. After a couple of days, Facebook suggests some far far far relatives whom you even have no idea.
- You are doing window shopping and hope to find stuff that you want to buy. Then a bunch of items like jeans, handbags, belts, etc. After a while, do you realize how far you have gone following those things and totally forgot your original purpose?
- You are not in the mood and listening to several pop songs. Then you find out that there are still other pop songs with the same melody, never a … rock.
All of the stories that I just listed above share a common thing: a (or some) certain system(s) is capable of looking for and suggesting something that you might be unaware of but likely find interesting. Such systems are recommendation systems (or platforms, engines) — an emerging application of Artificial Intelligent. If you plan to start your career in the AI domain, I hope this series will provide you some interests and motivations.
What is Recommendation System (RS)?
Think of it as a broker who has spent time with you, more or less, somehow figure out your preference, trying to find someone matching that preference, and recommending them for you. You are alone, a cool dude, and interesting in nice girls. What matters is there are a lot of persons you need to date and figure out, which then takes a lot of time. Therefore, you need applications as matchmakers who can recommend you the ones that match your preference. The following figure is a common example in the e-commerce systems.
When you buy a product (or even show your interests in some products), the RS will display the other ones close to your interested items from some perspectives. Apparently, such a recommendation brings value to both customers and providers. Going a bit further, most of the large e-commercial systems like Amazon or Walmart have their own RS which is responsible for:
- Tracking customer behaviors in terms of favorite products, product rating, reviews, even number of clicks, product pageview, etc.
- Base on customers’ historical data, the RS will anticipate their next steps and provide the recommendations.
The idea is quite simple. In the next sections, we will go a bit further to figure out the functional components of an RS and the basic steps to implement it.
An RS adopting ML approach is characterized by 3 basic aspects, which are
- Targets of an RS — users
- Items that can be products of e-commerce systems, songs of digital music services, other users of social networks, or posts of blogs. Items are not only the data source as input for an RS but also the eventual output that an RS needs to determine and present to users.
- Feedbacks from users on items that can be a review, a rate, or whatever based on which the RS has glue to build its recommendations as well as to evaluate how much a recommendation responds to users’ expectations.
Given all the collected data, i.e. users profiles, item details, and feedbacks, it is necessary to represent the relationship between various aspects so that the RS can be improved by optimizing its objective which is in turn modeled as an optimization problem. An example is to use the matrix to expose the relationship between any 2 aspects, i.e. as the interest of each user at items as the following:
The matrix explains how much a user is interested in an item. The empty cell means that the users have not yet been associated with the items. It is that task of the RS — based on user-relevant historical data, it provides recommendations that are unaware to the users. In other words, it anticipates the weight of blank cells and prioritizes them according to a pre-defined standard before showing them to the users.
Basically, there are two approaches that an RS can adopt, that is
- Content-based recommendation systems: consider the message, characteristics of current items and recommend similar ones to users. An example is when you do shopping on Amazon for a jean for men, the system will automatically list down similar items.
- Collaborative Filtering recommendation system: analyzes users who share or have the same profiles or interests, or in general those who are close to each other from certain perspectives on the current items. Then other items associated with these users will be taken into account, prioritized, and recommended to the user. The principle of this approach is to provide recommendations based on the similarity of human interests. For example, if you usually read (or write) technical articles on some specific topics, the system will suggest you other relevant articles under a title like “The others are interesting in blah blah blah”, sound familiar, huh?
Apparently, in order for the content-based approach to work well, it needs (a lot) information about similar items. And in order to identify which items are similar to which items, it requires to collect data, process and analyze all the items in the databases. This is not the same as the user-based collaborative filtering one. Straightforwardly, it needs only item_id, list of user_id, and feedback related to the current item which makes the approach widely adopted by the RS.
How to Build a Recommendation System
In general, implementing an RS from the machine learning approach is done via 4 looped main steps: collect data, normalize data, choose and train model, and evaluation.
In a nutshell, if you rely on items rating made by users, then you just grab some existing data from your database. However, together with the technology development, users tend to spend more time on the Internet, their online behaviors are becoming more complicated and heterogeneous and their expectation is something much further than “1+1=2”. I believe you read somewhere on the Internet saying that sometimes, “even customers do not know what they are looking for”. Having said that, doing a guess on what should be suggested to users using only the data source of item rating is not very going to be enough. Alternatives to the rating metrics include the number of mouse clicks on the item, average session duration or interaction (bounce rate) on the item’s page details, and so on. Depending on the system and its objectives as well as the applied mathematics techniques for the next steps, the list of metrics can be varied.
After the collection, you might come up with a huge amount of data at different sessions as the responses for different events triggered during the visit of the item. The collected data is then manipulated using various (mathematical) techniques and eventually a matrix of user-item interest is obtained.
The output matrix of data collection is mainly a sparse matrix with many incomplete cells as shown on the left of the following figure. In order for machine learning algorithms to use, you need to normalize the data, for instance, to get the full matrix on the right of the figure.
Choose and train the model
Upon obtaining the complete matrix, you need to choose a model to calculate the similarity between the users and items. The detail of this step will be provided in my another post. The goal of this step is to answer which items are likely interesting to the users the most. The output is a list of a pre-defined number of such items.
Evaluate the model
Like any other machine learning problem, the model performance needs to be assessed based on the test dataset from which some model-specific parameter tuning can be performed to improve the recommendation results. No need to say, various problems with their own data will need different methods to evaluate.
In fact, in a practical deployment of an RS, we might consider other improvements depending on the system. For example,
- The list of N items can be adapted to be aligned to the situation of each user
- Should not recommend those that have already purchased or just have them low priority
- Need to update the model given newly generated data every day
I would like to send my big thanks to Pham Van Toan for the permission to translate his original post.
Originally published at https://emerging-it-technologies.blogspot.com.