Measurement Problems — 2
There are some measurement problems we may encounter when we want to sort our products in the most accurate and objective way for customers.
- Rating Products
- Sorting Products
- Sorting Reviews
Here we will try to understand the second topic from the above titles, which is useful in determining how we will evaluate the product we make when purchasing a product.
In this first part of the series, we covered the issue of rating products. If you want to know what affects us when buying a product and about the methods we can use when calculating product ratings, I leave the link of the first article of this series here.
Let’s start!
- Sorting Products
The sorting problem can be encountered not only for products but also in many different situations.
Let’s say there is a job posting and applicants; according to which category should we sort the candidates?
In this case, for example, we can sort by giving different weights in each category to the school score and the exam score.
By what criteria is the sorting of online courses offered on a website made?
What parameter should be considered here, the score of the course, the number of comments made for the course, the number of purchases of the course?
Let’s evaluate for these parameters!
*Sorting by Rating
Install and Imports
import pandas as pd
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 1000)
pd.set_option('display.expand_frame_repr', True)
pd.set_option('display.float_format', lambda x: '%5f' % x)
Reading the Data
df = pd.read_csv(r'C:\product_sorting.csv')
df.head(5)
df['rating'].mean()
df.sort_values('rating', ascending=False).head(5)
Can be sorted by ratings, but how accurate would that be?
If we sort in this way, courses with more purchases and courses with more comments will go down in this order.
*Sorting by Comment Count or Purchase Count
df.sort_values('purchase_count', ascending=False).head(20)
df.sort_values('commment_count', ascending=False).head(20)
Whether you like the course or not should also be taken into account.
*Sorting by Rating, Comment and Purchase
How can we combine these three parameters?
df['purchase_count_scaled'] = MinMaxScaler(feature_range=(1, 5)). \
fit(df[['purchase_count']]). \
transform(df[['purchase_count']])
df['comment_count_scaled'] = MinMaxScaler(feature_range=(1, 5)). \
fit(df[['commment_count']]). \
transform(df[['commment_count']])
With the standardization process, now that they are of the same type, we can take the average of them and weight them.
(df['comment_count_scaled'] * 32 / 100 +
df['purchase_count_scaled'] * 26 / 100 +
df['rating'] * 42 / 100)
Let’s do these calculation with a function!
def weighted_sorting_score(dataframe, w1=30, w2=28, w3=42):
return (dataframe['comment_count_scaled'] * w1 / 100 +
dataframe['purchase_count_scaled'] * w2 / 100 +
dataframe['rating'] * w3 / 100)
df['weighted_sorting_score'] = weighted_sorting_score(df)
df.sort_values('weighted_sorting_score', ascending=False).head(5)
In this way, we have seen the effects of all three parameters at the same time.
We can also use statistical methods while ranking, one of them is Bayesian Average Rating Score.
Statistical methods
*Bayesian Average Rating Score
With this method, we calculate the average over the score distribution.
We will make probabilistic calculations based on the values we already have.
import math
import scipy.stats as st
df.head()
def bayesian_average_rating(n, confidence=0.95):
if sum(n) == 0:
return 0
K = len(n)
z = st.norm.ppf(1 - (1 - confidence) / 2)
N = sum(n)
first_part = 0.0
second_part = 0.0
for k, _k in enumerate(n):
first_part += (k + 1) * (n[k] + 1) / (N + N)
second_part += (k +1) * (k+1) * (n[k] +1) / (N + K)
score = first_part - z * math.sqrt((second_part - first_part * first_part) / (N + K + 1))
return score
df['bar_score'] = df.apply(lambda x: bayesian_average_rating(x[['1_point',
'2_point',
'3_point',
'4_point',
'5_point']]), axis=1)
df.sort_values('bar_score', ascending=False).head(20)
df[df['course_name'].index.isin([5, 1])].sort_values('bar_score', ascending=False)
Courses with more reviews and purchases have fallen behind! This is because a sorting was made according to the score distribution of the courses only.
*Hybrid Sorting
This will be the method by which we will evaluate the BAR (Bayesian Average Rating) Score and other factors together.
def hybrid_sorting_score(dataframe, bar_w=60, wss_w=40):
bar_score = dataframe.apply(lambda x: bayesian_average_rating(x[['1_point',
'2_point',
'3_point',
'4_point',
'5_point']]), axis=1)
wss_score = weighted_sorting_score(dataframe)
return bar_score*bar_w/100 + wss_score*wss_w/100
df['hybrid_sorting_score'] = hybrid_sorting_score(df)
df.sort_values('hybrid_sorting_score', ascending=False).head(5)
As we can see from here, it will not be enough for us to sort a product only by looking at a single parameter.
We have to find and implement hybrid solutions according to our problem.
If you go to my Github page, you can see this project all in a script.
Conclusion
In this article, I have explained with an example by which criteria we can sort online courses offered on a website.
In the next article, we will cover the issue of Sorting Reviews.
Best, Neslihan