CODEX

Introduction to Recommendation System — Part 2

Published in

CodeX

8 min readFeb 3, 2021

In this article, I introduce basic concepts of a Recommendation System (RS) and 2 basic approaches of building an RS, namely Content-based and Collaborating Filtering. This time we dig into the implementation of a simple RS using Collaborating Filter approach using the item-based technique. You can The user-based one is

Collaborating Filtering (CF)

As mentioned in my previous article, CF is built based on the assumption of shared interests among different individuals. You may think about the law of large numbers and theories of crowd behavior. In a nutshell, it is stated as follows:

If 2 users u1 and u2 review n similar items or conduct close behaviors on those like watching, buying, listening, then they likely have same reviews or behaviors with other items. That's why they say Tell me who your friends are, I will tell you who you are.

While there are things that are easy to determine whether or not the others are close to them, there are situations in which it is not that easy. For example, Facebook can suggest the persons who are from your hometown, in the same school, or of the same group since such profile-related information can be easily figured out (and straightforwardly comparable). But for things like digital songs (or even classical music songs), a picture, news, CF is a more preferable candidate.

The core concept that every RS needs to consider is the similarity between things. Apart from the matrix representing the association between users and items as explained previously, you need a mathematical tool to calculate the similarity ( a.k.a distance) between objects, i.e users or items. Intuitively, the distance can be seen as the inverse of the similarity in the sense that

The further apart two objects are, the more dissimilar they are and the bigger the “distance” between them is
The more similar the objects are, the closer they are and the smaller the distance between them is.

In machine learning, there are several types of distance metrics: Euclidean — Manhattan — Minkowski, Cosin, Hamming, and Mahalanobis.

Euclidean — Manhattan — Minkowski Distance

I believe that you are familiar with the formula for Euclidean distance between 2 points with 2 dimensions:

For an n-dimensional space, correspondingly objects represented with n features, the formula is generalized into:

This metric is calculated based on the absolute distances in all the directions or objects’ features, given as:

As the generalized form of Euclidean and Manhattan distance, the formula for Minkowski Distance is

Note that for p at infinity, we have a so-called Chebychev Distance.

Cosin Distance

The cosine distance metric is used to find similarities between different documents. Typically, we adopt this metric in case the magnitude between vectors does not matter but the orientation. We measure the degree of angle between two documents or vectors containing the term frequencies.

Mahalanobis Distance

According to Wikipedia Definition, the Mahalanobis distance is a measure of the distance between a point P and a distribution D. The idea of measuring is, how many standard deviations away P is from the mean of D.

The benefit of using Mahalanobis distance is, it takes the whole set of data in terms of covariance into account to measure the similarity between two different objects.

Hamming distance

Unlike Minkowski dealing with continuous or numerical data, Hamming distance is a metric for comparing two strings of the same length. It is the number of positions at which the corresponding characters are different. For example, the Hamming distance of “apple” and “table” is 2. The larger the distance between the 2 strings is, the more dissimilar are they.

Now that we have a basic overview of different distance metrics, let’s move to the next step, i.e. build a simple RS of music songs.

Music Recommendation System

We use the dataset of million songs from here. This dataset consists of two kinds of data at the song level: tags and similar songs. We will need a data file to save a weighted matrix that represents which users listen to the songs of which composers.

Data Load

The dataset has 1257 users and 286 artists and artists here can be considered as items. The data is loaded from data.csv file with the pandas library.

import pandas as pdif __name__ == '__main__':
   data = pd.read_csv('data.csv')
   print(data.shape)
   print(data.head(8).iloc[:,0:6 ])

Here are the first 8 lines and 6 columns of the dataset:

>>> print(data.shape)
(1257, 286)
>>> print(data.head(8).iloc[:,0:6 ])
   user  a perfect circle  abba  ac/dc  adam green  aerosmith
0     1                 0     0      0           0          0
1    33                 0     0      0           1          0
2    42                 0     0      0           0          0
3    51                 0     0      0           0          0
4    62                 0     0      0           0          0
5    75                 0     0      0           0          0
6   130                 0     0      0           0          0
7   141                 0     0      0           0          0
>>>

From now, we know that the data is actually a binary matrix where 0 means that the user never listens to any songs of the artist and 1 means he listens to the artist at least once.

Data Preprocessing

The goal of this example is to show how to do the CF based on items, we only need the data related to the items (or artists in this case) we don’t need user information. So we simply just drop the column user out:

# --- Start Item Based Recommendations --- #
# Drop any column named "user"
data_item_base = data.drop('user', 1)

Similarity Calculating

You need a data structure to store the similarity metrics. In this example, we use a square matrix items-items to represent the association between items by using DataFrame:

data_item_base_frame = pd.DataFrame(index=data_item_base.columns, columns=data_item_base.columns)
print(data_item_base_frame.head(6).iloc[:,0:6])

Then we get an initial matrix-liked structure where all the values for each pair of associated items are None.

                 a perfect circle abba ac/dc adam green aerosmith  afi
a perfect circle              NaN  NaN   NaN        NaN       NaN  NaN
abba                          NaN  NaN   NaN        NaN       NaN  NaN
ac/dc                         NaN  NaN   NaN        NaN       NaN  NaN
adam green                    NaN  NaN   NaN        NaN       NaN  NaN
aerosmith                     NaN  NaN   NaN        NaN       NaN  NaN
afi                           NaN  NaN   NaN        NaN       NaN  NaN

Now it’s time to calculate the distance according to one of the distance types mentioned in the above section. Here we use cosin distance to measure the similarity. We don’t need to care much about the details thanks to existing libraries like SciPy.

In the following code, we use 2 loops to calculate the similarity of each pair of items. In line 7, we use 1 - cosin_distance the returned cosin value ranges between [0, 1] and to make sure the condition smaller distance more similar. To improve the performance, this matrix is stored in a file so as to avoid re-calculate it again and again.

from scipy.spatial.distance import cosine
# Calculate similarily
    for i in range(0, len(data_item_base_frame.columns)):
        # Loop through the columns for each column
        for j in range(0, len(data_item_base_frame.columns)):
            # Calculate similarity
            data_item_base_frame.iloc[i, j] = 1 - cosine(data.iloc[:, i], data.iloc[:, j])
# Save to a file for next time usage 
data_item_base_frame.to_csv('data_item_base_frame.csv', sep=',', encoding='utf-8')

After finishing the similarity calculation and updating the DataFrame of the variable data_item_base_frame, we come up with a matrix as following:

>>> data_item_base_frame
                 a perfect circle      abba     ac/dc adam green aerosmith  ...   trivium        u2 underoath   volbeat yann tiersen
a perfect circle              1.0  0.167317   0.14403   0.230294  0.181366  ...  0.159596  0.140394  0.163286  0.115202     0.142385
abba                     0.167317       1.0       0.0   0.017917  0.051554  ...  0.030359  0.111154  0.024398   0.06506     0.052164
ac/dc                     0.14403       0.0       1.0   0.052279  0.025071  ...  0.029527       0.0  0.094916       0.0     0.025367
adam green               0.230294  0.017917  0.052279        1.0  0.113154  ...       0.0  0.087131  0.122398    0.0204     0.130849
aerosmith                0.181366  0.051554  0.025071   0.113154       1.0  ...  0.082169  0.025071  0.022011       0.0     0.023531
...                           ...       ...       ...        ...       ...  ...       ...       ...       ...       ...          ...
trivium                  0.159596  0.030359  0.029527        0.0  0.082169  ...       1.0  0.029527  0.077771       0.0          0.0
u2                       0.140394  0.111154       0.0   0.087131  0.025071  ...  0.029527       1.0  0.023729  0.126554     0.050735
underoath                0.163286  0.024398  0.094916   0.122398  0.022011  ...  0.077771  0.023729       1.0       0.0     0.022272
volbeat                  0.115202   0.06506       0.0     0.0204       0.0  ...       0.0  0.126554       0.0       1.0          0.0
yann tiersen             0.142385  0.052164  0.025367   0.130849  0.023531  ...       0.0  0.050735  0.022272       0.0          1.0

Filtering

So far we have all the necessary information about similar items. What we do next is that for each item, we filter its n most similar items by using a look as follows:

# Initial a frame for save closes neighbors to an item
data_neighbors = pd.DataFrame(index=data_item_base_frame.columns, columns = range(1, 11))for i in range(0, len(data_item_base_frame.columns)):
  data_neighbors.iloc[i,:10] = data_item_base_frame.iloc[0:, i].sort_values(ascending=False)[:10].index

We set n = 10 meaning that we will take 10 items that are close to the current item the most. Whenever the users when they choose a certain song, the RS will check the value of the variable data_neighbors and display 10 similar songs to the users. Here is the result:

>>> data_neighbors
                                1               2                 3   ...                 8                   9                  10
a perfect circle  a perfect circle    lostprophets   crystal castles  ...               mgmt               bjork         digitalism
abba                          abba       tori amos  dropkick murphys  ...     tegan and sara  panic at the disco         the smiths
ac/dc                        ac/dc      mando diao        schandmaul  ...  killswitch engage          guano apes            editors
adam green              adam green  regina spektor              mgmt  ...         digitalism             ramones            justice
aerosmith                aerosmith    the national       the subways  ...     the libertines       frank sinatra        the wombats
...                            ...             ...               ...  ...                ...                 ...                ...
trivium                    trivium  kelly clarkson        kanye west  ...             oomph!    a perfect circle     the libertines
u2                              u2     amon amarth               atb  ...      dream theater                 him            incubus
underoath                underoath      schandmaul              dido  ...    crystal castles          mando diao   the decemberists
volbeat                    volbeat      audioslave       alicia keys  ...            genesis      britney spears  the white stripes
yann tiersen          yann tiersen   amy macdonald           incubus  ...    subway to sally         david bowie   a perfect circle[285 rows x 10 columns]

You can check out the full code for this item-based CF system here:

import pandas as pd
from scipy.spatial.distance import cosineif __name__ == '__main__':
    data = pd.read_csv('data.csv')
    # --- Start Item Based Recommendations --- #
    # Drop any column named "user"
    data_item_base = data.drop('user', 1)
    # store DataFrame
    data_item_base_frame = pd.DataFrame(index=data_item_base.columns, columns=data_item_base.columns)
    # Calculate similarily
    for i in range(0, len(data_item_base_frame.columns)):
        # Loop through the columns for each column
        for j in range(0, len(data_item_base_frame.columns)):
            # Calculate similarity
            data_item_base_frame.iloc[i, j] = 1 - cosine(data.iloc[:, i], data.iloc[:, j])data_item_base_frame.to_csv('data_item_base_frame.csv', sep=',', encoding='utf-8')
    # data_item_base_frame = pd.read_csv('data_item_base_frame.csv')
    print data_item_base_frame.head(6).iloc [:,0:5]# Initial a frame for save closes neighbors to an item
   data_neighbors = pd.DataFrame(index=data_item_base_frame.columns, columns = range(1, 11))for i in range(0, len(data_item_base_frame.columns)):
        data_neighbors.iloc[i,:10] = data_item_base_frame.iloc[0:, i].sort_values(ascending=False)[:10].index

In my next article, I will explain how to implement a user-based CF system. TBD

Acknowledge

I would like to send my big thanks to Pham Van Toan for the permission to translate his original post.

Originally published at https://techsharing21.com on February 3, 2021.