Finding Books That Should Be Loved But Aren’t Yet, Using Goodreads Ratings
Lots of people have read The Girl With The Dragon Tattoo and sequels. It is an international bestseller that has spawned American and Swedish movie franchises, and more books written by David Lagercrantz.
Now who’s read another Nordic Noir novel?
Publishing is just one of many areas dominated by power-law phenomena. A few books become blockbuster successes. There is a shrinking mid-list of regularly profitable authors, and a long tail of obscure books. I can’t predict or make blockbusters (I’m a data scientist, not an oracle), but what if we could figure out books which would be well-liked if people actually read them? These books would be reasonable targets for marketing campaigns to try and move them up the ladder.
This problem falls under the general categories of collective filtering and recommendations systems. In plain English, if we know how different members of the population like different books, we can say things like “someone who likes Neuromancer, Snow Crash, and Altered Carbon would also like Hardwired by Walter Jon Williams”. And similarly, we can say that group of people who like the prior books probably have little overlap with people who love Pride and Prejudice.
Mathematically, you can represent this as a matrix of users x items, with the values being what user j has rated item i. This matrix is both large and sparse. The task is to fill in all the white space with predicted ratings.
This is a special application of Singular Value Decomposition, a common tool for dimensionality reduction. Any large matrix i x u can be written as the product of two smaller matrices i x f and f x u plus an error term. f is a hyperparameter describing the number of factors, which are interpreted as some quality the applies to both the items and users. Snow Crash is cyberpunk, satire, action, digressive, and has little romance. Michael likes cyberpunk, satire, digressions, and doesn’t much care for romance. Good match.
Specifically, each value in the big matrix above can be described as the sum of the mean values across the entire matrix, plus a bias term for the user and item (how generally positive they are) plus the dot product of the factor vectors in the corresponding row and column of i x f and f x u.
And if we compare the predicted ratings to the actual ratings for where there is data, we can define a loss function, which can then be optimized using gradient descent on the factor vectors and bias terms with a regularization parameter lambda.
I used a small platoon of scrapers running on AWS to collect 1.2 million data points, over 370,000 books rated by 4000 users. This is roughly 15 times as many data points as my last project on analyzing what makes for a popular book review. My scrape simply was not running fast enough to meet the project deadline, but for $12 I was able to set up a half-dozen EC2 instances and complete a scrape that would have taken a couple months in a week. On a personal note, as someone who appreciates brute force approaches, I like having this hammer in my toolbox!
The Surprise package for Python provides an sklearn-compatible API that handles recommendation systems very elegantly. Just feed it a dataframe where each row is (user, item, rating), and a few lines of code is enough to create a recommendation system.
algo3 = SVD(lr_all=0.005, reg_all = 0.01, n_factors=40)
kf = KFold(n_splits=3)for trainset, testset in kf.split(data):
t0 = time.time()
algo3.fit(trainset)
predictions = algo3.test(testset)accuracy.rmse(predictions, verbose=True)
t1 = time.time()
print(t1-t0)
In fact, Surprise makes it so easy that I decided that just making a recommendation system wasn’t interesting enough to call a project. Further, improving recommendation systems is hard: a decade ago Netflix offered $1 million prize for a 10% improvement in root mean squared error. Goodreads already has a very nice recommendation system, and since they have all the data, I can’t improve on their recommendations for any specific user.
The predicted rating distribution for each book is a normal distribution (actually normal, I checked), which is specified by a mean and standard deviation. The first version of this code ran in several hours, but I reworked it to vectorize the calculation rather than using nested for loops, and sped it up to a minute.
for book in df_items.itertuples():
bi = book[2]
qi = book[3:]
ratings = []
factors = qi @ pu.T
ratings = bi+bu+factors
means.append(np.mean(factors))
stdevs.append(np.std(factors))
normals.append(scipy.stats.normaltest(factors))
If you pick a threshold for book quality, the area above that threshold represents the proportion of the population that will like the book.
I’ve taken everything above and created a Bokeh dashboard (this may take some time to load, be gentle. And if it is down, email me.). You can adjust the slides to set the range of times books have been rated on Goodreads, the proportion of readers the book should appeal to, and the quality percentile. Color indicates the mean of the distribution. Mousing over a point will display the title of the book and some other information.
Behind the scenes, the Bokeh dashboard only displays a sample of 5000 points in order to run reasonably quickly. The SVD model only has results for the approximately 30,000 books which have been rated by five or more users, a pruning of noisy points which halved my RMSE. Books are identified by an MD5 hash of author+title, since ISBNs are more granular than this project actually needs. I’ve included the first five factors as options in the dashboard, but haven’t been able to figure out if they correspond to anything that makes sense in the human universe, like genres. Finally, I’d like to thank William Koehrsen, who’s series on interactive Bokeh was invaluable.
Code for this project can be found on git at goodreads_rec_engine.