Measurement Problems — 1

4 min readMar 19, 2023

With this article, I will start a new mini-series on measurement problems, which has an important place in the field of data science.

This series will consist of three main titles:

Rating Products
Sorting Products
Sorting Reviews

These are the titles that are useful in determining how we will evaluate the product we make when purchasing a product.

Let’s say we decide to buy a product. When we search for any product we want to buy on the internet, we can see that there are many options.

So what are the things that will help us decide in such a situation?

Social Proof

Social proof is the act of copying other people’s behavior. That is, when we decide to buy a product, we look at other people’s comments and evaluations while considering which one to choose, and this affects us when purchasing a product.

Of course, I can’t explain this definition psychologically, but I guess I can say that basically getting what everyone approves causes us to feel in the comfort zone.

The term Social Proof was put forward in his book published in 1984 by Robert Cialdini.

This brings us to another definition:

The Wisdom of Crowds

The feeling that if everyone has bought that product among so many products, I should buy that too. Because why not and I don’t want to think too much!

Measurement Problems

Here, what the marketplace needs to do is to sort the products in the most accurate and objective way for customers.

There are some measurement problems encountered while doing this.

In this first part of the series, we’ll cover the issue of rating products.

Rating Products

There are several methods we will use when calculating product ratings.

*Average

*Time-Based Weighted Average

*User-Based Weighted Average

*Weighted Rating

I will try to explain the above-mentioned methods that we will use for product rating with an example.

In this example, we will examine the ratings of an online course.

Install and Imports

import pandas as pd
import datetime as dt
import math
import scipy.stats as st
from sklearn.preprocessing import MinMaxScaler

pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.width', 500)
pd.set_option('display.expand_frame_repr', False)
pd.set_option('display.float_format', lambda x: '%5f' % x)

Reading the Data

df = pd.read_csv(r'C:\course_reviews.csv')

df.head()
df.shape

‘Rating’ distribution

df['Rating'].value_counts()

Distribution of ‘Question Asked’

df['Questions Asked'].value_counts()

The average scores given by the questions asked

df.groupby('Questions Asked').agg({'Questions Asked': 'count',
                                   'Rating': 'mean'})

*Average

df['Rating'].mean()

With the average we can calculate rating products, but what is overlooked here is the timing of the evaluation for the product.

There may be an increase or decrease in the scores at the moment, this should also be evaluated.

In this case, we should use the Time-Based Weighted Average to evaluate other situations as well.

*Time-Based Weighted Average

Let’s convert ‘Timestamp’ variable to time variable

df['Timestamp'] = pd.to_datetime(df['Timestamp'])

current_date = pd.to_datetime('2021-02-10 0:0:0')

We should choose this date in accordance with the dataset. This will be accepted as zero and accordingly how many days ago the comment was entered will be determined.

df['days'] = (current_date - df['Timestamp']).dt.days

Comments in the last 30 days

df[df['days'] <= 30] .count()

Average of the comments in the last 30 days

df.loc[df['days'] <= 30, 'Rating'].mean()

We can now reflect the effect of time to the weight calculation by giving different weights to the results at these different time intervals.

df.loc[df['days'] <= 30, 'Rating'].mean() * 28/100 + \
df.loc[(df['days'] > 30) & (df['days'] <= 90), 'Rating'].mean() * 26/100 + \
df.loc[(df['days'] > 90) & (df['days'] <= 180), 'Rating'].mean() * 24/100 + \
df.loc[(df['days'] > 180), 'Rating'].mean() * 22/100

Let’s do these calculation with a function

def time_based_weighted_average(dataframe, w1=28, w2=26, w3=24, w4=22):
    return  dataframe.loc[df['days'] <= 30, 'Rating'].mean() * w1/100 + \
            dataframe.loc[(df['days'] > 30) & (dataframe['days'] <= 90), 'Rating'].mean() * w2/100 + \
            dataframe.loc[(df['days'] > 90) & (dataframe['days'] <= 180), 'Rating'].mean() * w3/100 + \
            dataframe.loc[(df['days'] > 180), 'Rating'].mean() * w4/100

time_based_weighted_average(df)
time_based_weighted_average(df, 30, 26, 22, 22)

In this way, we can change the weights as we want.

*User-Based Weighted Average

The ratings of users who watched the course at different rates should be evaluated with different weights:

df.groupby('Progress').agg({'Rating': 'mean'})

According to the results, there is a relationship between the progress and the score given.

To calculate the weighted score with the course’s viewing status:

df.loc[df['Progress'] <= 10, 'Rating'].mean() * 22/100 + \
    df.loc[(df['Progress'] > 10) & (df['Progress'] <= 45), 'Rating'].mean() * 24/100 + \
    df.loc[(df['Progress'] > 45) & (df['Progress'] <= 75), 'Rating'].mean() * 26/100 + \
    df.loc[(df['Progress'] > 75), 'Rating'].mean() * 28/100

Let’s do these calculation with a function

def user_based_weighted_average(dataframe, w1=22, w2=24, w3=26, w4=28):
    return  dataframe.loc[df['Progress'] <= 10, 'Rating'].mean() * w1/100 + \
            dataframe.loc[(df['Progress'] > 10) & (dataframe['Progress'] <= 45), 'Rating'].mean() * w2/100 + \
            dataframe.loc[(df['Progress'] > 45) & (dataframe['Progress'] <= 75), 'Rating'].mean() * w3/100 + \
            dataframe.loc[(df['Progress'] > 75), 'Rating'].mean() * w4/100

user_based_weighted_average(df)
user_based_weighted_average(df, 20, 24, 26, 30)

In this case, we have made the rating products calculation more sensitive!

*Weighted Rating

We will combine Time-Based Weighted Average and User-Based Weighted Average calculations in a single function.

def course_weighted_rating(dataframe, time_w=50, user_w=50):
    return time_based_weighted_average(dataframe) * time_w/100 + user_based_weighted_average(dataframe) * user_w/100

course_weighted_rating(df)

If you go to my Github page, you can see this project all in a script.

Conclusion

In this article, I have explained things that will help us decide when we want to buy a product, and we have seen the measurement problems that can arise when we want to sort the products in the most accurate and objective way.

In the next article, we will cover the issue of sorting products.

Convince and ye shall be convinced!

Best, Neslihan

Measurement Problems — 1

Measurement Problems

Written by Neslihan Avsar