Predicting Kindle books’ reviews

Jesús Caballero Medrano
6 min readMar 6, 2020

--

In 2018 Amazon hit the one trillion dollars milestone, duplicating its value in a year. This milestone is making it more valuable than Walmart, Samsung, Netflix, and even Disney, all put together.

But, why? Amazon, who controls 49% of the eCommerce, process a lot of data of their customers and also who are not one yet. In offline stores, you have to become a member of any rewards club to be identified, so your purchases. On the other hand, online activity analysis comes so much easy. When you enter to its site, you become the perfect customer: every move, every click in the webpage store is an opportunity to learn from you; even the device you use and how many time you spend seeing a product!

Another strategy it has for maximizing profit is analyze popular products bought but that the customers don’t care about the brand, like knives and batteries among others. Processing the return and review data, they can choose which product to manufacture on its own in a cheaper way. Here is where the reviews come important to the business.

In addition, Amazon loses money when selling some of its own-branded products, like the Kindle, a series of e-readers designated to enable users to browse, buy, download, and read e-books, newspapers, magazines and other digital media via wireless networking to the Kindle Store. But it is just one other strategy only because it has been shown that Kindle owners are likely to expend more than people that don’t. That’s why Kindle books’ reviews are such important for the company.

I got a data set of Kindle books’ reviews from 2008 to 2018. The written reviews come together with:

  • Reviewer ID
  • Verification of the purchase.
  • The Amazon Standard Identification Number (ASIN) of the Kindle book.
  • A short summary, commonly used as a title for the review.
  • The time when the review was made.
  • Any votes from other costumers that agree to the reviewer’s post.
  • And a photo, if the reviewer decided to post one.

First I look for the proportion of scores. These come from 1 to 5, and I got the share of the score:

  • Score 5: 60.53%
  • Score 4 25.29%
  • Score 3: 8.93%
  • Score 2: 3.07%
  • Score 1: 2.16%
The score 5 is the most popular among reviews.

Because of the limitations of my computational resources, I had to take a random sample of 10% of the data, which contains 2 million reviews, verifying I keep the same proportion of the scores in the sample.

Next, I engineered some of the features: like mark on which season of the year the review was made, or if the reviewer is in the top 50 most active reviewers, the length of the summary and the review text itself, and measured the subjectivity and polarity of the summary and the review text as well.
What is this? Subjectivity and polarity are part of the Sentimental Analysis that can be made to a text by AI. Sentimental Analysis relies on the importance of the order of words. “great” is a positive sentence, while “not great” is not. I used a tool developed by Tom De Smedt who assigned a value for each word.

I got an interesting plot about the correlation between the subjectivity and the polarity, it seems to more subjectivity implies more polarity.

Subjectivity vs Polarity

Another technique I used is splitting the review text, so I can relate the used words with the score. But all words are not the same in importance than the others. For example, the word <<the>> is the most used word in English, with about 6%+ of share, the second is <<of>> and the third is <<and>>. We can not leave this to affect the prediction. So I used a technique called Term Frequency-Inverse Document Frequency (TF-IDF) which weights words in order of importance. But how this importance is calculated? The importance increases by the number of times it is mentioned in the review. But, it is offset by the global times mentioned in all the reviews of a specific score.

Luckily, this method pulls out all the most common words in English that might not be relevant in the context, these words are known as Stop Words. Usually are prepositions, articles, and numbers. Then we have our data set ready to go.

Frequency of words in English.

I run a RandomForestClassifier to predict the score of a bunch of reviews, and some interesting results I got. The first I would like to point is the complexity of trying to get a model. I tried some other models and all I tried become better scoring 5 when the review is also 5, but for the other, it comes not too good.

(Picture)

Why? One possible cause I thought we, as customers are more likely to use words from a “bag of words” more specific when we are happy with the purchase, rather when we are not. For example, if I am satisfied with the product I purchased, I hit the 5 star bottom and write: Great, Good, Excellent. But when it turns to a bad experience, we want to let the world know we have had a bad experience, emphasizing with more distinct words our disappointment. This thought can be supported by the next plot. It shows the mean weight of the words in all reviews.

Most relevant words in the reviews.

Another thing I would like to pop up of this model are some of the most important features: the polarity of the review text, the length of the review, followed by the subjectivity of the review text. A disclaimer must be said. Due the limitations of my resources, I have to took out some of what I called: so uncommon words, that have less than 0.01% of weight in the score. That reduced significantly my set of words form the 20 thousands to the two hundreds. I would really like to have the opportunity to run the model with all the variables at least, and if possible, with all reviews.

In conclusion, this is just a step towards predicting scores in reviews. The most important conclusion might be: get a good computer or server to get thru it first!

Some other the visualizations I got from the data that are worth to share:

Number of words of the review among years.
Number of votes a review had received, sorted by its verification status.
Top 50 features for the RandomForestClassifier.

Credits of the data set: Jianmo Ni, Jiacheng Li, Julian McAuley
Empirical Methods in Natural Language Processing (EMNLP), 2019

https://nijianmo.github.io/amazon/

--

--