An introduction to BetterReads
How to use machine learning & NLP to rapidly summarize book reviews
This post is an introduction to BetterReads, the interactive book review summarization app I built in the spring of 2020. But more generally, it is an explanation of how natural language processing and unsupervised machine learning techniques can be used to distill any set of reviews down into their most important insights.
You can cut to the chase and check the app out for yourself here, you can watch a video walkthrough and explanation below, you can view the full code for the app on Github, or you can read on and learn more about how it works here.
I love books. Like many book lovers, I always have a tall stack of books I’m currently reading, and an even taller stack of books I’m thinking about reading. Still, it’s often difficult to know what to read next, and I often find myself wishing that I could get a better sense of what a book is like before opening it up.
This shouldn’t be so difficult. Thanks to the Internet and literary social networks like GoodReads and LibraryThing, there are more book reviews out there than ever before, written by actual readers reviewing the books they’ve just read. But the abundance of information is also the problem: there are so many reviews that no one person can possibly sift through them all.
We can do better. With the tools of data science and natural language processing, we can harness the potential of all this information and boil thousands of reviews down to their most valuable insights, all in a matter of seconds. And that’s precisely what BetterReads does, rapidly extracting the most commonly opinions across all of a book’s reviews.
The core technology behind BetterReads is not limited to book reviews, and could be used to summarize any collection of textual data, whether those be product reviews, restaurant reviews, or even survey responses. But there are a few reasons that make reader reviews from GoodReads a particularly good data source for this kind of task.
First, as a social network for readers, GoodReads members tend to be well read and, correspondingly, their reviews tend to be well written. This provides us with a particularly rich source of textual data to work with.
Second, as a noncommercial platform, GoodReads review pages are not overrun with fake reviews simply trying to inflate the book’s rating. This means that GoodReads reviews, in addition to being high quality, are also a high fidelity source of textual data.
The datasets that the BetterReads app runs on come from two sources: the UCSD Book Graph dataset, which contains over 15 million full-text reviews scraped in 2017, and a self-made web-scraping script, which can collect the reviews from any GoodReads URL. This means that, in principle, we can use our algorithm to summarize the reviews of any book on GoodReads.
The approach: Extractive text summarization
There are many ways one might go about identifying the most commonly expressed opinions across a book’s reviews. You could look at the most frequently occurring words or word pairs, or make a word cloud, or do some topic modelling. BetterReads is different: It approaches the task as an exercise in text summarization.
Text summarization is a technique in natural language processing that does basically exactly what its name suggests. It is typically used to create synopses of medium-length texts such as news articles or academic papers. What’s distinctive about BetterReads is that it uses this same technique to summarize the “text” that consists of all of a book’s reviews.
Text summarization comes in two flavours: extractive and abstractive. Extractive text summarization generates a summary from the original source text’s own sentences, identifying the most important sentences in the source text and dropping everything else. Abstractive text summarization generates a novel summary, using the source text as a guide but not copying any of the original sentences exactly.
BetterReads uses extractive text summarization. The process for all this consists of five basic steps.
First, we collect all of the book’s reviews together into a single document, or dataset. This complete set of reviews is the “text” that the algorithm is tasked with summarizing.
Second, we split the reviews up into their individual sentences. We do this because each review expresses not just one opinion but several, and we want to find the most commonly expressed opinions across all of the book’s reviews (regardless of what other opinions may be expressed in those reviews). We therefore assume that each of a review’s sentences corresponds to one of the opinions expressed in that review.
(The next three steps are where the real data science comes in. I will get into the details of these steps in the next section, but first let me provide a high-level overview.)
Third, we encode each sentence according to its meaning, or the opinion it expresses. Sentences with similar meanings get encoded similarly, and sentences with different meanings get encoded differently. This is a bit like assigning each sentence a distinct colour, with semantically similar sentence being assigned similar colours. (This is just a metaphor! The actual math is much more complicated, as we’ll soon see.)
Fourth, we find the most commonly expressed opinions in our dataset by looking for the most frequently occurring encodings. To continue the previous metaphor, this is a bit like looking for which colours are most well represented across all of our encoded sentences. We can choose to find whatever number of distinct opinions we like.
Lastly, we select a single sentence for each opinion by identifying the most representative sentence or sentences in each opinion group. This is a bit like finding the strongest or purest version of the colour in each colour group.
These sentences are the algorithm’s outputs. Seeing all these sentences together gives us a quick picture of the most commonly expressed opinions within the full set of reviews.
The data science: Embeddings, clustering, & filtering
But how does BetterReads work at a deeper level? The complicated steps are the last three mentioned above: encoding the sentences according to the opinion they express, finding the most commonly expressed opinions in the dataset, and selecting each opinion’s most representative sentence or sentences. Let’s take these tasks one by one.
To encode the sentences we use the Universal Sentence Encoder, made freely available by Google through TensorFlow Hub. The Universal Sentence Encoder embeds any string of words into a unique 512-dimensional vector. In other words, it assigns each sentence to a unique location in 512-dimensional space. Furthermore, the Universal Sentence Encoder is designed in such a way that sentences with similar meanings get placed in similar locations.
Thus, if we encode all of our sentences in this way, we can represent all of our sentences as points in 512-dimensional space, with semantically similar sentences placed in similar regions. This looks sort of like what you see in the visualization above. (This visualization is, of course, a two-dimensional reduction of our 512-dimensional space, but you get the picture.)
So the good news is that we now have all our data in numerical form. The bad news is that our data looks like one giant undifferentiated clump. To find the most commonly expressed opinions in this cloud of points, we need to locate the highest density regions in the space, or the regions where the greatest number of points are clustered close together. We accomplish this using k-means clustering, which provides a quick and easy way to identify these regions. Given a set of vectors, it will automatically divide the data points into their k highest density regions.
Here we’ve divided our previous vector space into six clusters, each represented by its own colour. And we can already see some differentiation between these clusters, though only four of them are clearly distinguished.
Now we don’t actually care about all of the points in each cluster. Remember, in the end we only want to identify the most representative sentences in each cluster. This is accomplished simply by finding the points that are closest to the centre of each cluster, since this is the “anchor” of the cluster, where its density is at its highest. (Technically, this is done by calculating the inner product between the cluster centre and all the points in the cluster, and looking for the points whose inner product is the highest, but the precise mathematical details here need not concern us here.) Thus we ignore all the points at the peripheries of each cluster, and look only at the points around each cluster centre.
And violá! Six distinct colour clusters can now be seen. (And remember, this visualization is a two-dimensional reduction; the clusters are even more distinct in the true 512-dimensional space.)
Now that you know how the BetterReads app works, don’t forget to try it out for yourself, at bit.ly/betterreads!
Special thanks to Mengting Wan and Julian McAuley of UCSD, for making the Book Graph dataset freely available online; see their papers “Item Recommendation on Monotonic Behavior Chains” and “Fine-Grained Spoiler Detection from Large-Scale Review Corpora”.
Kushal Chauhan’s blog post on “Unsupervised Text Summarization using Sentence Embeddings” was a huge help and a big inspiration early on.
Unsupervised Text Summarization using Sentence Embeddings
I will describe the approach that I used to perform Text Summarization on a multi language dataset of customer support…
The BetterReads app was made as part of my capstone for the Data Science Diploma Program at BrainStation. Many many thanks to my educators: Myles Harrison, Adam Thorstenstein, Govind Suresh, Patrick Min, and Daria Aza. Thanks also to all my fantastic classmates!