Measurement Problems Episode II: Rating & Sorting Products — Reviews

Cem ÖZÇELİK
8 min readJul 20, 2022

--

Hello again from the second episode of our measurement problems series. In our previous study, we focused on hypothesis testing and studied AB testing. In this article, we will work on the rating and sorting problems that we frequently encounter in the e-commerce industry.

In our previous study, we focused on hypothesis testing and studied AB testing. In this article, we will work on the rating and sorting problems that we frequently encounter in the e-shopping sector.

Sorting and rating, which are two important factors we encounter in the world of e-commerce, which is more involved in our lives during the pandemic period, are concepts that we can direct the purchasing movements of a user while performing the shopping process on the e-commerce site, which provide great profitability if used correctly and increase sales.

To explain a little more concretely, imagine that you want to buy a computer over the internet. You entered the website you want to buy a computer and searched for the computer you want to buy. There are dozens of products from different vendors on the computer you have searched against. So, how do you choose the product you want to buy among dozens of options? Product score? By looking at the comments of the product? Or maybe you’re considering both together. Well, what does it mean to you that the product you are interested in is bought and reviewed by how many people and even makes an evaluation about it?

I want to draw your attention to one issue. I want you to think that you have encountered a situation like the one we gave an example in the paragraph above. It is far from a coincidence that the products with the highest score are the first ones that come up when you search, the products with the most comments and reviews are at the top, and even when you make a choice, the comments that affect you the most are at the top, even in the comments of the product. All of these are conscious applications that are calculated by feeding the social proof concept and are made to guide your purchasing processes.

The wisdom of crowds has an enormous impact on people’s daily behavior. From politics to shopping people are led by a super hyper voodoo thing called the wisdom of crowds. Here is how to prepare it:

3 grams of psychology

2 grams of neuroscience

5 grams of statistics

1 appropriate computer and a sprinkle of calculation skills

Now that we’ve talked enough about what social proof is, let’s move on to the practical part.

Our data set contains information about the products in the Amazon.com database, product ratings, product evaluations and whether these evaluations have been found useful by other users.In order to get to know the dataset better, it can be emphasized more in the study.

In the first step, let’s import the necessary libraries

We also imported our dataset, in the next step, let’s take a look at the structure of the dataset.

Overlook of the Amazon dataset

reviewID : Shopper user ID

asin: Product ID

reviewerName: Username who made the purchase

helpful: The degree of useful review of the product

reviewText: Review, comment for the product

overall:Product average score

summary: Review summary

unixReviewTime: Time to comment

reviewTime: Time to comment in raw format

day_diff:Number of days after assessment

helpful_yes:The number of usefulness of the review made for the product

total_vote:Total number of reviews for the product

After examining the dataset better, we take a look at the structure and size of the dataset.

The image of the dataset after applying these two functions to our dataset:

The Image of the dataset
  • Note : The value_counts method was not used in the data set summary, as the Data Set contains only the comments and ratings for a product.

The first evaluation method of the product is to obtain the average score of the product. Therefore, we obtain the average of the “Overall” variable.

Obtaining the average evaluation score of the product may be considered sufficient for an evaluation of the product according to some business problems. However, it does not fully reflect the reality at the general level.

As the second stage in the evaluation, we will consider the time factor and break down the time period from the purchase of the product by the user who bought the product to the present day, and we will focus on the time periods. Thus, we will separate the reviews of the user who bought our product in the past, and the user reviews that we think have just bought the product and do not have a chance to make a correct assessment.

As seen between time periods, there are no big differences. But as the number of days increases, a decrease is seen in the overall variable. If we are only going to comment on the days, the values given to the product decrease according to the increase in the time. And the raw score of the product can be found by starting from the averages given in 600> days(q_3>).

We found the net point value of the product by using the time weighting method. But in different business problems, it is possible to find the score level of the product on a more solid basis by weighting the user according to the business problem.

After finding the rating of the product, let’s list the comments and evaluations of the product. By using the social proof factor, we can influence the purchasing decisions of users who review the products on the site.

As a first step, let’s get the number of useless comments. For this, the number of useful comments out of all comments

As the second step, we delete the variables that will not be used in our data set. We are making our dataset more useful.

In this part, I will introduce you to the Wilson Lower Bound Score. So what is the Wilson Lower Bound score? To summarize, *it is a tool used for scoring products, comments or items that contain binary interactions.

*Note: Condition products or objects referred to as double interaction, Like-Dislike; Up Rate — Down Rate; Helpful — Not Helpful.

As a working principle, Wilson Lower Bound calculates a confidence interval for the **Bernoulli parameter P. It accepts the lower bound of the confidence interval as the Wilson Lower Bound Score (WLB Score).

**Note: Bernoulli is used as a statistical distribution to calculate probabilities of binary states.

Let’s explain the working principle of WLB Score with an example, first of all, let’s assume that the ***average score of the comments made on a product is 0.6, in this case we set a confidence interval by the user and <the confidence interval we set here is 95%> for this product, “95% confidence Interpretations for this product will be in the range of 0.5–0.7”. Afterwards, we consider the lower limit of this confidence interval, 0.5, as the Wilson Lower Bound Score.

***Note: We calculate the average score value of the product as Up Rate(Helpful Rate) / All Rate.

Let’s define the other two evaluation metrics used in the rating of the comments made on the product and get a score for each of these two values. In this way, we will have the opportunity to interpret the rating made about the product from different perspectives and sort it correctly.

First, let’s define the Up-Down Differences Score function. The Up-Down Diff Score is found by subtracting the negative rate from the positive rate for the product. The difference between the two is considered the Up-down diff score.

Another measurement metric is the Average Score metric. With this metric, we rate the Up Rate values, which we accept as positive rates, to all rate values. That is, Up Rate/ (Up Rate+Down Rate). However, there is a potential error in this calculation metric. So let’s examine two situations:

In the first case, 2 Positive Rate 0 negative rate evaluations are made for the comment made for the product. In this case, the average value for the product will be 2/(2+0) = 1.0. In the second case, 100 positive rate and 1 negative rate evaluations should be made for the product. In this case, the average rate will be: 100/(100+1) = 0.99. When a ranking is made according to the comment average rate values, the product with 1.0 value but less comment rating will be placed above the product with 0.99 average. So the Social Proof factor will not be provided .

The final image of the dataset was as follows:

Final shape of the dataset

Now, let’s evaluate our results, rank the top 20 comments according to Wilson Lower Bound Score.

Sorting the comments made about the product

To summarize the sorting table:

  • When we look at the observations at the top, we can see that they have the highest total vote values. We can also interpret that social proof is provided for the evaluated product, which causes it to be at the top of the ranking.
  • When we look at the helpful_yes and helpful_no distributions when we pay attention to the observations at the top, we see that the yes values ​​are high.
  • Also, when we take a look at the day_diff variable, we see that the products with some values ​​<158, such as 283> appear at the top, which shows us the products with potential. In other words, it will not be a surprise to see the products that will sell the most in the future.
  • Finally, when the overall variable is examined, although some products have an average of 1.00, they are at the top of the ranking. This is due to the social proof factor. In other words, the evaluation of a product by a large audience has affected the ranking logic of the product in this list.

We have come to the end of the second and final study of our series of measurement problems. See you in our next post :)

You can reach the github repository of this work from the link and you can see my other works by visiting my github profile.

--

--