Improving Topic Identification with k-means Clustering

Reputation Datascience
Reputation.com Datascience Blog
4 min readApr 14, 2018

It is pretty obvious nowadays that reviews are powerful. We can tell what people care about what they write in their reviews. Then we can see what these reviewers like or dislike by analyzing the sentiment on popular topics being talked about. An issue that arises when performing topic modeling is the use of multiple words when talking about the same thing and the overlap of these topics. Two different reviews can talk about the same topic using different wording. For example, someone can use the word “cafe” and someone else says “restaurant” but they can be talking about the same thing.

There are various techniques that can help cluster words together to avoid splitting a single topic. This blog focuses on using GloVe, an unsupervised learning algorithm developed by Jeffrey Pennington, Richard Socher, and Christopher D. Manning at Stanford. In this blog we will be using Stanford’s pre-trained word vectors that will help us detect words that are similar to each other.

Let’s look at an example of how GloVe can improve topic identification in a certain collection (corpus) of reviews.

Goal

The goal of this analysis is to better identify when a review is talking about a particular topic even if it uses different phrasing. Specifically, we are going to look at reviews in the restaurant/bar industry.

Motivation

We can take a look at the most frequent words in a corpus of reviews. After doing simple tokenizing, here are the top 20 words and their raw frequency in 4,842 restaurant reviews:

Now, there is already some overlap in these words. For example, beer and drink are the same topic but are phrased differently. How can we make a corpus of words so that there is less overlap of words?

Methodology (k-means clustering)

Since there is some overlap it would be beneficial to group similar words so the generated topics are better grouped. A clustering technique would be helpful here. Specifically, we used k-means clustering which finds the distance between data points — in this case our words — and groups them into k clusters, by minimizing the distances between each word and the centroid of the cluster. We minimized the Euclidean distance of each word and its centroid and found k using n and a reduction factor, f, where k = n*f.

Now how do we get the distance between words? Lucky for us, we aren’t the first people who have wanted to analyze the distances between words. As mentioned above, we have a pre-trained word vector for each word in the corpus. This vector can be used as the Euclidean coordinates of the word. We used a 50-dimensional vector for simplicity and speed.

Additionally, we can reduce the number of meaningless clusters by computing the average Euclidean distance within each cluster. If this average distance is greater than the average distance for all clusters that have at least two words, we throw out the cluster. For our 2,704 clusters, there were 786 that had at least two words. Of these clusters, the average Euclidean distance was 1.755. Furthermore, 344 clusters had a Euclidean distance greater than 1.755 and were discarded. Below are the clusters that were meaningless with their corresponding average Euclidean distance.

The following shows good clusters that were produced and their corresponding average Euclidean distance:

Results

After we remove the bad clusters and keep the good clusters, we can replace the words in each good cluster with the most frequent word in that cluster. For example, say we have the cluster [beer, drink, drinks]. All of the words in this cluster, beer, drinks, are replaced by the most frequently mentioned word, drink. This allows us see the most frequent topics instead of individual words. Our new most frequent word groups are as follows:

Analysis

There are a few important things to note from these results:

  • Drink and beer get clustered together which increases their frequency and ranking
  • Wait, ask, waitress, husband, and lunch are also ranked higher because of their associated clusters
  • Place, time, staff and order all have decreased rankings because they have no clusters

Each of these words can be considered a topic that includes everything within its cluster. Using k-means clustering to group these words takes into account the different ways people can say things (and not just synonyms). Grouping words together helps to pick up when people talk about a certain topic better. This allows other topics that can be said in numerous different ways to surface. However, word clustering is not perfect and should not be the only tool used to classify topics. We are continuing to work on this topic in order to identify better and better ways to group similar customer feedback and draw insights for our clients.

Author: Sara Mahar

Related

Originally published at tech.reputation.com on April 14, 2018.

--

--