## CODEX

# Introduction to Recommendation System — Part 2

In this article, I introduce basic concepts of a Recommendation System (RS) and 2 basic approaches of building an RS, namely Content-based and Collaborating Filtering. This time we dig into the implementation of a simple RS using Collaborating Filter approach using the item-based technique. You can The user-based one is

# Collaborating Filtering (CF)

As mentioned in my previous article, CF is built based on the assumption of shared interests among different individuals. You may think about the law of large numbers and theories of crowd behavior. In a nutshell, it is stated as follows:

`If 2 users u1 and u2 review n similar items or conduct close behaviors on those like watching, buying, listening, then they likely have same reviews or behaviors with other items. That's why they say `

*Tell me who your friends are, I will tell you who you are.*

While there are things that are easy to determine whether or not the others are close to them, there are situations in which it is not that easy. For example, Facebook can suggest the persons who are from your hometown, in the same school, or of the same group since such profile-related information can be easily figured out (and straightforwardly comparable). But for things like digital songs (or even classical music songs), a picture, news, CF is a more preferable candidate.

The core concept that every RS needs to consider is the similarity between things. Apart from the matrix representing the association between users and items as explained previously, you need a mathematical tool to calculate the similarity ( *a.k.a distance*) between objects, i.e users or items. Intuitively, the distance can be seen as the inverse of the similarity in the sense that

- The further apart two objects are, the more dissimilar they are and the bigger the “
**distance**” between them is - The more
**similar**the objects are, the closer they are and the smaller the**distance**between them is.

In machine learning, there are several types of distance metrics: Euclidean — Manhattan — Minkowski, Cosin, Hamming, and Mahalanobis.

## Euclidean — Manhattan — Minkowski Distance

I believe that you are familiar with the formula for Euclidean distance between 2 points with 2 dimensions:

For an n-dimensional space, correspondingly objects represented with n features, the formula is generalized into:

This metric is calculated based on the absolute distances in all the directions or objects’ features, given as:

As the generalized form of Euclidean and Manhattan distance, the formula for Minkowski Distance is

Note that for p at infinity, we have a so-called Chebychev Distance.

## Cosin Distance

The cosine distance metric is used to find similarities between different documents. Typically, we adopt this metric in case the magnitude between vectors does not matter but the orientation. We measure the degree of angle between two documents or vectors containing the term frequencies.

## Mahalanobis Distance

According to Wikipedia Definition, the Mahalanobis distance is a measure of the distance between a point P and a distribution D. The idea of measuring is, how many standard deviations away P is from the mean of D.

The benefit of using Mahalanobis distance is, it takes the whole set of data in terms of covariance into account to measure the similarity between two different objects.

## Hamming distance

Unlike Minkowski dealing with continuous or numerical data, Hamming distance is a metric for comparing two strings of the same length. It is the number of positions at which the corresponding characters are different. For example, the Hamming distance of “apple” and “table” is 2. The larger the distance between the 2 strings is, the more dissimilar are they.

Now that we have a basic overview of different distance metrics, let’s move to the next step, i.e. build a simple RS of music songs.

# Music Recommendation System

We use the dataset of million songs from here. This dataset consists of two kinds of data at the song level: tags and similar songs. We will need a data file to save a weighted matrix that represents which users listen to the songs of which composers.

## Data Load

The dataset has 1257 users and 286 artists and artists here can be considered as items. The data is loaded from *data.csv* file with the *pandas *library.

import pandas as pdif __name__ == '__main__':

data = pd.read_csv('data.csv')

print(data.shape)

print(data.head(8).iloc[:,0:6 ])

Here are the first 8 lines and 6 columns of the dataset:

`>>> print(data.shape)`

(1257, 286)

>>> print(data.head(8).iloc[:,0:6 ])

user a perfect circle abba ac/dc adam green aerosmith

0 1 0 0 0 0 0

1 33 0 0 0 1 0

2 42 0 0 0 0 0

3 51 0 0 0 0 0

4 62 0 0 0 0 0

5 75 0 0 0 0 0

6 130 0 0 0 0 0

7 141 0 0 0 0 0

>>>

From now, we know that the data is actually a binary matrix where **0** means that the user never listens to any songs of the artist and **1** means he listens to the artist at least once.

## Data Preprocessing

The goal of this example is to show how to do the CF based on items, we only need the data related to the **items** (or **artists** in this case) we don’t need user information. So we simply just drop the column **user **out:

`# --- Start Item Based Recommendations --- #`

# Drop any column named "user"

data_item_base = data.drop('user', 1)

## Similarity Calculating

You need a data structure to store the similarity metrics. In this example, we use a square matrix items-items to represent the association between items by using **DataFrame**:

`data_item_base_frame = pd.DataFrame(index=data_item_base.columns, columns=data_item_base.columns)`

print(data_item_base_frame.head(6).iloc[:,0:6])

Then we get an initial matrix-liked structure where all the values for each pair of associated items are None.

` a perfect circle abba ac/dc adam green aerosmith afi`

a perfect circle NaN NaN NaN NaN NaN NaN

abba NaN NaN NaN NaN NaN NaN

ac/dc NaN NaN NaN NaN NaN NaN

adam green NaN NaN NaN NaN NaN NaN

aerosmith NaN NaN NaN NaN NaN NaN

afi NaN NaN NaN NaN NaN NaN

Now it’s time to calculate the distance according to one of the distance types mentioned in the above section. Here we use **cosin** distance to measure the similarity. We don’t need to care much about the details thanks to existing libraries like SciPy.

In the following code, we use 2 loops to calculate the similarity of each pair of items. In line 7, we use `1 - cosin_distance`

the returned **cosin **value ranges between [0, 1] and to make sure the condition **smaller distance more similar**. To improve the performance, this matrix is stored in a file so as to avoid re-calculate it again and again.

`from scipy.spatial.distance import cosine`

# Calculate similarily

for i in range(0, len(data_item_base_frame.columns)):

# Loop through the columns for each column

for j in range(0, len(data_item_base_frame.columns)):

# Calculate similarity

data_item_base_frame.iloc[i, j] = 1 - cosine(data.iloc[:, i], data.iloc[:, j])

# Save to a file for next time usage

data_item_base_frame.to_csv('data_item_base_frame.csv', sep=',', encoding='utf-8')

After finishing the similarity calculation and updating the **DataFrame **of the variable `data_item_base_frame`

, we come up with a matrix as following:

`>>> data_item_base_frame`

a perfect circle abba ac/dc adam green aerosmith ... trivium u2 underoath volbeat yann tiersen

a perfect circle 1.0 0.167317 0.14403 0.230294 0.181366 ... 0.159596 0.140394 0.163286 0.115202 0.142385

abba 0.167317 1.0 0.0 0.017917 0.051554 ... 0.030359 0.111154 0.024398 0.06506 0.052164

ac/dc 0.14403 0.0 1.0 0.052279 0.025071 ... 0.029527 0.0 0.094916 0.0 0.025367

adam green 0.230294 0.017917 0.052279 1.0 0.113154 ... 0.0 0.087131 0.122398 0.0204 0.130849

aerosmith 0.181366 0.051554 0.025071 0.113154 1.0 ... 0.082169 0.025071 0.022011 0.0 0.023531

... ... ... ... ... ... ... ... ... ... ... ...

trivium 0.159596 0.030359 0.029527 0.0 0.082169 ... 1.0 0.029527 0.077771 0.0 0.0

u2 0.140394 0.111154 0.0 0.087131 0.025071 ... 0.029527 1.0 0.023729 0.126554 0.050735

underoath 0.163286 0.024398 0.094916 0.122398 0.022011 ... 0.077771 0.023729 1.0 0.0 0.022272

volbeat 0.115202 0.06506 0.0 0.0204 0.0 ... 0.0 0.126554 0.0 1.0 0.0

yann tiersen 0.142385 0.052164 0.025367 0.130849 0.023531 ... 0.0 0.050735 0.022272 0.0 1.0

## Filtering

So far we have all the necessary information about similar items. What we do next is that for each item, we filter its `n`

most similar items by using a look as follows:

# Initial a frame for save closes neighbors to an item

data_neighbors = pd.DataFrame(index=data_item_base_frame.columns, columns = range(1, 11))for i in range(0, len(data_item_base_frame.columns)):

data_neighbors.iloc[i,:10] = data_item_base_frame.iloc[0:, i].sort_values(ascending=False)[:10].index

We set n = 10 meaning that we will take 10 items that are close to the current item the most. Whenever the users when they choose a certain song, the RS will check the value of the variable data_neighbors and display 10 similar songs to the users. Here is the result:

>>> data_neighbors

1 2 3 ... 8 9 10

a perfect circle a perfect circle lostprophets crystal castles ... mgmt bjork digitalism

abba abba tori amos dropkick murphys ... tegan and sara panic at the disco the smiths

ac/dc ac/dc mando diao schandmaul ... killswitch engage guano apes editors

adam green adam green regina spektor mgmt ... digitalism ramones justice

aerosmith aerosmith the national the subways ... the libertines frank sinatra the wombats

... ... ... ... ... ... ... ...

trivium trivium kelly clarkson kanye west ... oomph! a perfect circle the libertines

u2 u2 amon amarth atb ... dream theater him incubus

underoath underoath schandmaul dido ... crystal castles mando diao the decemberists

volbeat volbeat audioslave alicia keys ... genesis britney spears the white stripes

yann tiersen yann tiersen amy macdonald incubus ... subway to sally david bowie a perfect circle[285 rows x 10 columns]

You can check out the full code for this item-based CF system here:

import pandas as pd

from scipy.spatial.distance import cosineif __name__ == '__main__':

data = pd.read_csv('data.csv')

# --- Start Item Based Recommendations --- #

# Drop any column named "user"

data_item_base = data.drop('user', 1)

# store DataFrame

data_item_base_frame = pd.DataFrame(index=data_item_base.columns, columns=data_item_base.columns)

# Calculate similarily

for i in range(0, len(data_item_base_frame.columns)):

# Loop through the columns for each column

for j in range(0, len(data_item_base_frame.columns)):

# Calculate similarity

data_item_base_frame.iloc[i, j] = 1 - cosine(data.iloc[:, i], data.iloc[:, j])data_item_base_frame.to_csv('data_item_base_frame.csv', sep=',', encoding='utf-8')

# data_item_base_frame = pd.read_csv('data_item_base_frame.csv')

print data_item_base_frame.head(6).iloc [:,0:5]# Initial a frame for save closes neighbors to an item

data_neighbors = pd.DataFrame(index=data_item_base_frame.columns, columns = range(1, 11))for i in range(0, len(data_item_base_frame.columns)):

data_neighbors.iloc[i,:10] = data_item_base_frame.iloc[0:, i].sort_values(ascending=False)[:10].index

In my next article, I will explain how to implement a user-based CF system. TBD

# Acknowledge

I would like to send my big thanks to Pham Van Toan for the permission to translate his original post.

*Originally published at **https://techsharing21.com** on February 3, 2021.*