You may also like...

Published in

Web Mining [IS688, Spring 2021]

8 min readApr 22, 2021

Similarity metrics

Distances between data elements can be measured in various ways. The most widely used ones are Euclidean distance and Cosine distance / popularly known as Cosine similarity.

The below diagram illustrates both Euclidean distance and Cosine similarity. ‘d’ represents the Euclidean distance and theta(Ɵ) represents the cosine angle between two elements in a dataset.

Euclidean Distance

Euclidean distance is the measure of length between two points on a two-dimensional graph. It is calculated by the cartesian coordinates of two points using the Pythagorean Theorem. This distance helps understand and focuses more on dissimilarities than similar ones. So, for my analysis, it is not the best model to use.

Euclidean distance formula

A & B are two cartesian coordinates between which the distance is calculated and represented as ‘d’.

Cosine distance or Cosine Similarity

Cosine similarity is the measure of cosine angle between two non-zero vectors (data elements with a direction). It is calculated as a dot product of two vectors divided by the product of the lengths of two vectors. The smaller angle between them, the closer they are in terms of cosine similarity metric. This focuses and helps derive the similarity rather than the distance. This approach works better for my analysis because for e.g. if we are calculating the distance between two products and because of their price they are distant in Euclidean distance metric. However, they may still be similar in terms of category and zip code where they are available to ship. So, I will use Cosine similarity for my analysis.

In this analysis, I am using python and the following libraries with a CSV dataset:

pandas: for data manipulation and data sourcing.

numpy: for numeric operations (random number generator)

scipy: for matrix operation ‘Sparse’.

sklearn: for finding the cosine similarity.

from scipy import sparse
import sklearn.metrics.pairwise as pw
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

Data Sourcing and preparation

The CSV Dataset consists of the product list of Walmart in 2019 which looks like the below. The dataset consists of Unique ID, Product Name, Description, List Price, Brand, Postal Code, and Availability. It has about 30,000 records which are sufficient for my analysis but it needs a decent amount of data cleansing and manipulation which is explained later in the article.

I extracted the data from the unprocessed CSV file using the below python code:

import pandas as pddata = pd.read_csv('walmart_20191_data.csv')

Below columns were identified for the recommender system based on their relevance.

Columns: Product Name, List price, Brand, Category

I split the dataset keeping only the above columns.

f_data = pd.DataFrame(data, columns=['Product Name', 'List Price', 'Brand', 'Category'])

Since the dataset does not have Ratings and Review information, I added those two columns for Rating and Number of Reviews.

A random number generator was used to assign ratings (1–5) and review counts (0–100).

f_data['Total Reviews'] = np.random.randint(0, 100, size=len(f_data))f_data['Rating'] = np.random.randint(1, 5, size=len(f_data))

I now have the dataset with the required columns. Let us filter it further:

f_data = pd.DataFrame(data, columns=['Product Name', 'List Price', 'Category', 'Total Reviews', 'Rating'])

Total records so far: 30,000

The distribution of the ratings and list price over categories are as follows

To avoid low/average rated products from the recommendation, I decided to reduce the dataset to incorporate only the Ratings which are 60% or more.

m = f_data['Rating'].quantile(0.6)m_data = f_data[f_data['Rating'] >= m]

Total Records now: 15,006

To give the direction and magnitude to the products we need to vectorize the data elements. To do so, we define a pivot on which the dataset was converted into a sparse matrix. Sparse matrix vectorized the data elements to be able to find differences in angles between the data points. To define the data points on which the similarity would be calculated, I chose Rating and Category as two columns.

pivot_item_based = pd.pivot_table(m_data,                                  index='Product Name',                                  columns=['List Price'], values=                   ['Rating', 'Category'])

let’s check the pivot item

print(pivot_item_based.head(10))

So, I need to fill the NaN with zero and also have to vectorize this dataset to find similarities.

Vectorize the data:

sparse_pivot = sparse.csr_matrix(pivot_item_based.fillna(0))

Now the sparse matric is ready, I can find out the cosine similarity/distance which will be in form of a 2-dimensional matrix with values 0-1( any real number)

recommender = pw.cosine_similarity(sparse_pivot)recommender_df = pd.DataFrame(recommender,                              columns=pivot_item_based.index,                              index=pivot_item_based.index)print(recommender_df.head())

Our cosine vector matrix is ready and looks like the below table where the x-axis and y-axis represent the products and the number represents their relative cosine distance. Zero means no similarity and 1 means the highest degree of similarity. Any value (0.5, 0.6, etc.) would mean some degree of similarity.

So, the recommender system model is ready. I must now test this model. To do so, I will have to take one product as input and will find out similar products.

I have tested with 3 input items ( as query items) and their respective results.

Pristine Blue Pristine Power Non-Chlorine Shock for Pools and Spas

I used the below code to pass the input to the model I created above.

## Product Rating Based Cosine Similaritycosine_df = pd.DataFrame(
    recommender_df[
        'Pristine Blue Pristine Power Non-Chlorine Shock for Pools and Spas'].sort_values(ascending=False))cosine_df.reset_index(level=0, inplace=True)
cosine_df.columns = ['Product Name', 'cosine_similarity']print(cosine_df.head(10))

Result:

So, the similarity result is returned by my recommendation system aligns to the input item/query item category. This makes sense because I chose category as the similarity feature.

I will test with another product as a query item from a different category.

cosine_df = pd.DataFrame(
    recommender_df[
        'Tie-Me-Not Curly No-Tie Twister Shoelaces, 2 Pairs White'].sort_values(ascending=False))

cosine_df.reset_index(level=0, inplace=True)
cosine_df.columns = ['Product Name', 'cosine_similarity']
print(cosine_df.head(10))

Result

Clearly, the similarity is selected based on the category again. In this case, sports items are selected as recommendations based on the input.

Let’s look at another item from a different category as input( query item).

cosine_df = pd.DataFrame(
    recommender_df[
        'Paragon Popcorn Butter Bags'].sort_values(ascending=False))

cosine_df.reset_index(level=0, inplace=True)
cosine_df.columns = ['Product Name', 'cosine_similarity']
print(cosine_df.head(10))

Result

Here the results are the kitchen and household items as in the case of the input query item.

Conclusion:

The output shows us the relevant products that can be recommended based on the Category of the products and Ratings. As you can see the above products are similar because they have similar ratings, and they belong to the same/similar category of products in the dataset.

This analysis is particularly important in the eCommerce business because it helps businesses to target customers based on their historical purchases and browsing behavior. It is found that users are more likely to purchase the suggested product which is based on their browsing patterns. This also reduces the number of searches users must make to find a similar product they might be interested in near future. Recommendation instills a sense of personalization and care which ensures higher customer loyalty and greater conversion between adding items to cart and checkout.

Limitations and further study:

My analysis was based on a representative dataset but far from the actual user’s ratings and product details. So, although the model is efficient enough to find similarities based on category and Ratings it lacks the data related to user habits/behavior and history of purchase, which is crucial in a recommender system such as this one.

This dataset also is based on randomly generated Ratings for each product. Hence, the rating is uniformly distributed among products. That is the reason we could see only 4 similar products. There will be a non-uniform distribution of ratings in the actual dataset, which will have its indigenous biases and may be distributed towards a set of products (usually the best-sellers). So, the recommender system will find a bigger list to recommend from.

References

Data Mining: Finding Similar Items and Users

How to find related items? Here are recipes based on really simple formulas. If you pay attention, this technique is…

alexn.org

Cosine Similarity - GeeksforGeeks

Prerequisite - Measures of Distance in Data Mining In Data Mining, similarity measure refers to distance with…

www.geeksforgeeks.org

How To Filter Pandas Dataframe By Values of Column? - Python and R Tips

One of the biggest advantages of having the data as a Pandas Dataframe is that Pandas allows us to slice and dice the…

cmdlinetips.com

https://pandas.pydata.org