What Makes for a Popular Book Review on Goodreads.com

Published in

MBF-data-science

7 min readOct 25, 2018

I’ve done a lot of cool stuff: Earned a doctorate, set foot on all seven continents, skied down Philipe’s (Not my footage. I’m strongly opposed to GoProing my own epic yard sales). I could go on, but one of the things that I’m proudest of is having over 250 people like my review of Jonathan Haidt’s The Righteous Mind on Goodreads. I read 127 books per year on average, and since 2011 I’ve reviewed them all on Goodreads. But aside from The Righteous Mind, only a handful of my reviews have broken through. For my first independent Metis data science project, I decided to take a crack at understanding the factors that explain the popularity of book reviews on Goodreads.

There’s a business case here, beyond the dopamine spikes I get when I see that strangers on the internet think I have something interesting to say. Goodreads is a social network for book lovers. Reviews are the sites’ primary content. Engaging reviews are cause people to stick around; good news for browsers, reviewers, and Goodreads as a whole. And the more people use Goodreads, learning about books they might like to read and books that they should avoid, the higher the odds that they’ll buy books from Goodreads’ corporate parent, Amazon.com. Figuring out what makes for reviews people like matters.

Data Collection and Modeling

A Goodreads review looks like this. The “likes” in the bottom right corner are our target variable. We can scrape several features from a review, the associate user page, and book page: full text of the review, word count, number of images, how well the reviewer liked the book, times rated by everyone on Goodreads, the average rating, and how many friends and followers the user has.

Goodreads reviews have a standard URL format+ a number between 2 and 2.5 billion, so at first I tried random scrapes. It turns out that most review pages on Goodreads are blank ‘to-read’ stubs, and most reviews have zero likes.

To get one valid data point, I’d have to check 1000 random numbers, which at 5 seconds per scrape would take forever to accumulate reasonable amounts of data. Fortunately, Goodreads also organizes books into lists, which are sorted by popular books and popular book reviews. With a combination of BeautifulSoup and Selenium, I was able to scrape 75000 book reviews, for a total of 200 MB of data. Fellow Metis data scientist Aaron Frederick pointed me in the direction of multiprocessing, which turned out to be both very useful — it gave me a 4x speed increase on my data collection — and technically sweet.

Python multiprocessing includes a class called Pool. You instantiate a Pool, and then call map on it, passing it a function (review_scraper) and list of whatever the function uses as input (review_urls), and it outputs a list of what the function returns.

import multiprocessing as multipool = multi.Pool()
reviews = pool.map(review_scraper, review_urls)
pool.terminate()

Wow! Multiprocessing in four lines of code, including the import. Scraping is a very important skill for data scientists, but it’s rarely elegant. Mostly, it involves digging through web pages with inspect to figure out exactly how the designers identified the information you’re after (though if you want to see the code, it is up on github).

With data in hand, it’s time to run initial models. The learning objectives for the first project are centered around linear regression, which is not a particularly sophisticated model, but provides a very useful baseline. Linear regression is fast to calculate, easy to interpret, and has variants like Lasso and Ridge which protect against overfitting. The initial model is pretty bad, with an R² of 0.256 and a Mean Squared Error of 5,883. MSE is a key metric for linear regression; on average, how far the modeled points are from the actual points. Or in plain English, how much you’re wrong by. In this case, the model guesses wrong by 76 likes on average.

Looking at the distributions for the target (likes, in blue in the upper left) and several features, we can see that these distributions are very right skewed. And this should not be surprising, because Goodreads is a social networking site, which means that its data are generated by preferential attachment mechanisms and therefore show scale-free behavior and exponential distributions. Popular things, like book reviews, books, and goodreads users, get more exposure than things in the great mass, and so become more popular over time. Melanie Mitchell at the Santa Fe Institute explains complexity and scale-free networks in a TED talk, an online course, and a really good book, Complexity: A Guided Tour.

These distributions are not strictly scale-free, failing to fit a power law under Anderson-Darling tests (a common result, with a technical reply), but the distributions are close enough that I thought it worthwhile to log-transform several features. The resulting distributions were much better.

Dropping some highly collinear features also improved the model. It turned out that number of friends (as opposed to followers), number of paragraphs (as opposed to word count), and the rating itself added very little.

Final R² is 0.467, and Mean Squared Error 34

This is solid. The numbers means that the model captures 46.7% of the variability in the data, and that on average our guess for how many likes a book review will get is off by only six. Values for skew, kurtosis, and condition number were also reasonable. The residuals plot showed some trends in the data, typical when dealing with counts which must be positive, but given that this is a messy human-driven system, the model performs quite well.

I checked to see if the model could be further improved in sklearn with Lasso, Ridge, Polynomial Features, and ElasticNet, and found that it couldn’t be. The features dropped by Lasso and Ridge were necessary for good predictions. Polynomial Features gave a tiny increase in metrics at the expense of model parsimony, and Polynomial Features + ElasticNet performed poorly. The simply linear regression is best.

Interpretation and Recommendations

Looking at the correlations and coefficients for the six remaining features, I can draw some clear conclusions about how to write popular book reviews.

Read books that lots of other people are reading.
Get more followers and friends on Goodreads.
Write longer reviews.
Use images.

I’m not going to switch my tastes towards more mass market books. Though as an aside, the top review on Goodreads is Patrick Rothfuss reviewing The Name of the Wind (guy is very good at the internet), and the next five are incendiary takes on Twilight, Harry Potter, The Hunger Games, and The Fault in Our Stars. Being funny and hateful about a popular thing is a reliable route to internet stardom. I can’t magic up more friends, though if you like book reviews, please friend me on Goodreads. It’ll be worth it.

“Michael writes the most cogent, incisive, helpful, and accurate book reviews. I get more out of them than reading the book in the first place.”
—Eric Kennedy, York University

I’m already on the high end for word count, so writing more might be counter-productive. One thing I can do, and have been doing, is start using more images. A picture is worth a thousand words, and while I refuse to go full tumblr-reaction gif, at least one image per post is well worth it.

Of course, book reviews are about content more than raw numbers, so a likely next step is to do natural language processing to see what kind of words and sentences are associated with popular reviews. NLP is part of Metis Project 4, which is still several weeks in the future from this post.

Now get reading!

What Makes for a Popular Book Review on Goodreads.com

Data Collection and Modeling

Final R² is 0.467, and Mean Squared Error 34

Interpretation and Recommendations

Written by Michael Burnam-Fink