CODEX

Introduction to Recommendation System — Part 2

Tuan Nguyen
CodeX

--

In this article, I introduce basic concepts of a Recommendation System (RS) and 2 basic approaches of building an RS, namely Content-based and Collaborating Filtering. This time we dig into the implementation of a simple RS using Collaborating Filter approach using the item-based technique. You can The user-based one is

Collaborating Filtering (CF)

As mentioned in my previous article, CF is built based on the assumption of shared interests among different individuals. You may think about the law of large numbers and theories of crowd behavior. In a nutshell, it is stated as follows:

If 2 users u1 and u2 review n similar items or conduct close behaviors on those like watching, buying, listening, then they likely have same reviews or behaviors with other items. That's why they say Tell me who your friends are, I will tell you who you are.

While there are things that are easy to determine whether or not the others are close to them, there are situations in which it is not that easy. For example, Facebook can suggest the persons who are from your hometown, in the same school, or of the same group since such profile-related information can be easily figured out (and straightforwardly comparable). But for things like digital songs (or even classical music songs), a picture, news, CF is a more preferable candidate.

The core concept that every RS needs to consider is the similarity between things. Apart from the matrix representing the association between users and items as explained previously, you need a mathematical tool to calculate the similarity ( a.k.a distance) between objects, i.e users or items. Intuitively, the distance can be seen as the inverse of the similarity in the sense that

  • The further apart two objects are, the more dissimilar they are and the bigger the “distance” between them is
  • The more similar the objects are, the closer they are and the smaller the distance between them is.

In machine learning, there are several types of distance metrics: Euclidean — Manhattan — Minkowski, Cosin, Hamming, and Mahalanobis.

Euclidean — Manhattan — Minkowski Distance

I believe that you are familiar with the formula for Euclidean distance between 2 points with 2 dimensions:

For an n-dimensional space, correspondingly objects represented with n features, the formula is generalized into:

This metric is calculated based on the absolute distances in all the directions or objects’ features, given as:

As the generalized form of Euclidean and Manhattan distance, the formula for Minkowski Distance is

Note that for p at infinity, we have a so-called Chebychev Distance.

Cosin Distance

The cosine distance metric is used to find similarities between different documents. Typically, we adopt this metric in case the magnitude between vectors does not matter but the orientation. We measure the degree of angle between two documents or vectors containing the term frequencies.

Mahalanobis Distance

According to Wikipedia Definition, the Mahalanobis distance is a measure of the distance between a point P and a distribution D. The idea of measuring is, how many standard deviations away P is from the mean of D.

The benefit of using Mahalanobis distance is, it takes the whole set of data in terms of covariance into account to measure the similarity between two different objects.

Hamming distance

Unlike Minkowski dealing with continuous or numerical data, Hamming distance is a metric for comparing two strings of the same length. It is the number of positions at which the corresponding characters are different. For example, the Hamming distance of “apple” and “table” is 2. The larger the distance between the 2 strings is, the more dissimilar are they.

Now that we have a basic overview of different distance metrics, let’s move to the next step, i.e. build a simple RS of music songs.

Music Recommendation System

We use the dataset of million songs from here. This dataset consists of two kinds of data at the song level: tags and similar songs. We will need a data file to save a weighted matrix that represents which users listen to the songs of which composers.

Data Load

The dataset has 1257 users and 286 artists and artists here can be considered as items. The data is loaded from data.csv file with the pandas library.

import pandas as pdif __name__ == '__main__':
data = pd.read_csv('data.csv')
print(data.shape)
print(data.head(8).iloc[:,0:6 ])

Here are the first 8 lines and 6 columns of the dataset:

>>> print(data.shape)
(1257, 286)
>>> print(data.head(8).iloc[:,0:6 ])
user a perfect circle abba ac/dc adam green aerosmith
0 1 0 0 0 0 0
1 33 0 0 0 1 0
2 42 0 0 0 0 0
3 51 0 0 0 0 0
4 62 0 0 0 0 0
5 75 0 0 0 0 0
6 130 0 0 0 0 0
7 141 0 0 0 0 0
>>>

From now, we know that the data is actually a binary matrix where 0 means that the user never listens to any songs of the artist and 1 means he listens to the artist at least once.

Data Preprocessing

The goal of this example is to show how to do the CF based on items, we only need the data related to the items (or artists in this case) we don’t need user information. So we simply just drop the column user out:

# --- Start Item Based Recommendations --- #
# Drop any column named "user"
data_item_base = data.drop('user', 1)

Similarity Calculating

You need a data structure to store the similarity metrics. In this example, we use a square matrix items-items to represent the association between items by using DataFrame:

data_item_base_frame = pd.DataFrame(index=data_item_base.columns, columns=data_item_base.columns)
print(data_item_base_frame.head(6).iloc[:,0:6])

Then we get an initial matrix-liked structure where all the values for each pair of associated items are None.

                 a perfect circle abba ac/dc adam green aerosmith  afi
a perfect circle NaN NaN NaN NaN NaN NaN
abba NaN NaN NaN NaN NaN NaN
ac/dc NaN NaN NaN NaN NaN NaN
adam green NaN NaN NaN NaN NaN NaN
aerosmith NaN NaN NaN NaN NaN NaN
afi NaN NaN NaN NaN NaN NaN

Now it’s time to calculate the distance according to one of the distance types mentioned in the above section. Here we use cosin distance to measure the similarity. We don’t need to care much about the details thanks to existing libraries like SciPy.

In the following code, we use 2 loops to calculate the similarity of each pair of items. In line 7, we use 1 - cosin_distance the returned cosin value ranges between [0, 1] and to make sure the condition smaller distance more similar. To improve the performance, this matrix is stored in a file so as to avoid re-calculate it again and again.

from scipy.spatial.distance import cosine
# Calculate similarily
for i in range(0, len(data_item_base_frame.columns)):
# Loop through the columns for each column
for j in range(0, len(data_item_base_frame.columns)):
# Calculate similarity
data_item_base_frame.iloc[i, j] = 1 - cosine(data.iloc[:, i], data.iloc[:, j])
# Save to a file for next time usage
data_item_base_frame.to_csv('data_item_base_frame.csv', sep=',', encoding='utf-8')

After finishing the similarity calculation and updating the DataFrame of the variable data_item_base_frame, we come up with a matrix as following:

>>> data_item_base_frame
a perfect circle abba ac/dc adam green aerosmith ... trivium u2 underoath volbeat yann tiersen
a perfect circle 1.0 0.167317 0.14403 0.230294 0.181366 ... 0.159596 0.140394 0.163286 0.115202 0.142385
abba 0.167317 1.0 0.0 0.017917 0.051554 ... 0.030359 0.111154 0.024398 0.06506 0.052164
ac/dc 0.14403 0.0 1.0 0.052279 0.025071 ... 0.029527 0.0 0.094916 0.0 0.025367
adam green 0.230294 0.017917 0.052279 1.0 0.113154 ... 0.0 0.087131 0.122398 0.0204 0.130849
aerosmith 0.181366 0.051554 0.025071 0.113154 1.0 ... 0.082169 0.025071 0.022011 0.0 0.023531
... ... ... ... ... ... ... ... ... ... ... ...
trivium 0.159596 0.030359 0.029527 0.0 0.082169 ... 1.0 0.029527 0.077771 0.0 0.0
u2 0.140394 0.111154 0.0 0.087131 0.025071 ... 0.029527 1.0 0.023729 0.126554 0.050735
underoath 0.163286 0.024398 0.094916 0.122398 0.022011 ... 0.077771 0.023729 1.0 0.0 0.022272
volbeat 0.115202 0.06506 0.0 0.0204 0.0 ... 0.0 0.126554 0.0 1.0 0.0
yann tiersen 0.142385 0.052164 0.025367 0.130849 0.023531 ... 0.0 0.050735 0.022272 0.0 1.0

Filtering

So far we have all the necessary information about similar items. What we do next is that for each item, we filter its n most similar items by using a look as follows:

# Initial a frame for save closes neighbors to an item
data_neighbors = pd.DataFrame(index=data_item_base_frame.columns, columns = range(1, 11))
for i in range(0, len(data_item_base_frame.columns)):
data_neighbors.iloc[i,:10] = data_item_base_frame.iloc[0:, i].sort_values(ascending=False)[:10].index

We set n = 10 meaning that we will take 10 items that are close to the current item the most. Whenever the users when they choose a certain song, the RS will check the value of the variable data_neighbors and display 10 similar songs to the users. Here is the result:

>>> data_neighbors
1 2 3 ... 8 9 10
a perfect circle a perfect circle lostprophets crystal castles ... mgmt bjork digitalism
abba abba tori amos dropkick murphys ... tegan and sara panic at the disco the smiths
ac/dc ac/dc mando diao schandmaul ... killswitch engage guano apes editors
adam green adam green regina spektor mgmt ... digitalism ramones justice
aerosmith aerosmith the national the subways ... the libertines frank sinatra the wombats
... ... ... ... ... ... ... ...
trivium trivium kelly clarkson kanye west ... oomph! a perfect circle the libertines
u2 u2 amon amarth atb ... dream theater him incubus
underoath underoath schandmaul dido ... crystal castles mando diao the decemberists
volbeat volbeat audioslave alicia keys ... genesis britney spears the white stripes
yann tiersen yann tiersen amy macdonald incubus ... subway to sally david bowie a perfect circle
[285 rows x 10 columns]

You can check out the full code for this item-based CF system here:

import pandas as pd
from scipy.spatial.distance import cosine
if __name__ == '__main__':
data = pd.read_csv('data.csv')
# --- Start Item Based Recommendations --- #
# Drop any column named "user"
data_item_base = data.drop('user', 1)
# store DataFrame
data_item_base_frame = pd.DataFrame(index=data_item_base.columns, columns=data_item_base.columns)
# Calculate similarily
for i in range(0, len(data_item_base_frame.columns)):
# Loop through the columns for each column
for j in range(0, len(data_item_base_frame.columns)):
# Calculate similarity
data_item_base_frame.iloc[i, j] = 1 - cosine(data.iloc[:, i], data.iloc[:, j])
data_item_base_frame.to_csv('data_item_base_frame.csv', sep=',', encoding='utf-8')
# data_item_base_frame = pd.read_csv('data_item_base_frame.csv')
print data_item_base_frame.head(6).iloc [:,0:5]
# Initial a frame for save closes neighbors to an item
data_neighbors = pd.DataFrame(index=data_item_base_frame.columns, columns = range(1, 11))
for i in range(0, len(data_item_base_frame.columns)):
data_neighbors.iloc[i,:10] = data_item_base_frame.iloc[0:, i].sort_values(ascending=False)[:10].index

In my next article, I will explain how to implement a user-based CF system. TBD

Acknowledge

I would like to send my big thanks to Pham Van Toan for the permission to translate his original post.

Originally published at https://techsharing21.com on February 3, 2021.

--

--