Aspect-Based Opinion Mining (NLP with Python)

9 min readJun 6, 2018

If you’ve heard of Uber or Amazon, you may be one of 40 million (Uber) or over 310 million (Amazon) active users of their platforms. At the end of the day, all of these businesses exist to provide a service, which means that the communication and the relationship with customers is a crucial element of their success. One of the challenges, however, has been the sheer scale of their growing user base, and the large amount of data that is becoming available in the form of natural language. This data can be in the form of direct feedback from the customers on their platforms (i.e. customer reviews or complaints) or even on social media platforms such as Twitter, where people seem to regularly tweet about their sentiment over businesses (over 250,000 tweets related to Amazon can be collected on a single day!).

The real challenge then, is for businesses to parse and organize this amount of data into a more digestible and actionable insight. Despite the mentioning of larger businesses, this challenge is not limited to them alone. Even small business owners listed on Yelp without a dedicated data analyst team to parse and analyze the customer feedback would benefit greatly from a more robust and automated pipeline that can reliably process the customer feedback into categories of interest. The way I tackled this problem was through an approach called Aspect-Based Opinion Mining, which I will demonstrate using Yelp restaurant reviews. While my particular implementation is by no means perfect, I hope that it will provide some insight into how to build a NLP pipeline to derive some meaningful insights from a large volume of reviews.

Note: If you are only interested in the potential applications of this approach, feel free to skip towards the end to take a look at the resulting charts as it suits your needs.

Tools & Dataset Used:

spaCy (tokenization, sentence boundary detection, dependency parser, etc.)
Scikit Learn & Scikit Multilearn (Label Powerset, MN Naive Bayes, Multilabel Binarizer, SGD classifier, Count Vectorizer & Tf-Idf, etc.)
Word2Vec & vectors pre-trained on Google’s News dataset
Neural Coref v2.0 —pre-trained neural network model for recognizing and replacing pronouns
Annotated Restaurant Reviews (SemEval-2014) — restaurant reviews that were manually labeled into categories
Yelp Dataset (kaggle)
Opinion lexicon (Minqing Hu and Bing Liu.)

Since Python is my go-to language, all of the tools and libraries I used are available for Python. If you’re more comfortable with other languages, there are many other approaches you can take, i.e. using Stanford CoreNLP which normally runs on Java, instead of spaCy which runs on python.

Methods

It’s important to understand that in NLP, granularity matters. By granularity, I simply mean the amount of text (or the level of text structure) in the individual datapoint that is being analyzed or classified. For example, let’s assume you’re trying to classify a single yelp restaurant review into one of five aspects: food, service, price, ambience, or simply anecdotal/miscellaneous. You could label the entire review and say that it mentions both food and price. At this level, you would have the most information about the context to make an accurate prediction, but it may require extra steps if you wanted to find out which particular sentence or word is referring to the specific aspect. Also, a document that mostly talks about food with one brief mention of price would be categorized in the same group as one that mostly talks about price and very little of food. At the individual word level, you have the most specificity; maybe a person was dissatisfied particularly with the music, which would be a feature of ambience. However, you may lose the context around the word to really parse out the more accurate context in deriving the sentiment around the aspect.

One loses context at higher granularity and specificity in lower granularity.

Given the advantages & disadvantages of different levels of granularity, I opted for a bit of a hybrid approach to get the specificity of what I will refer to as aspect-terms (i.e. music, types of food, etc.) as well as the correct sentiment around each broader aspect categories. At the sentence level, I classify each sentence using Naive Bayes, and at the word level, I match the words based on the similarity of the word vectors in a word embedding.

It’s probably easiest to understand the process through an example. Let’s take a look at a simple review for a restaurant left by a customer.

I first replace the pronouns in the sentence using a pre-trained neural coreference model; in this particular review, it is not really relevant. I then segment the chunk of text into sentences, and analyze sentence by sentence. The first step for a given sentence is to tag it with an aspect using a Multi-label Naive Bayes model that was fit on the annotated restaurant reviews training set (which had a 86.6% accuracy out-of-sample in the test data). (*Note: multi-label refers to the fact that each datapoint can have any or multiple classes, and there are multiple ways to approach this problem, but I used a Label Powerset transformation).

As one can tell, the first sentence is merely anecdotal and provides no useful value to the business, so it was categorized as anecdotal/miscellaneous by the ML-NB classifier. The second sentence, shown below, was classified into service and ambience with the ML-NB classifier.

The next step is to identify opinion words by cross referencing the opinion lexicon for negative and positive words. Once found, the spaCy’s dependency parser is able to identify other dependency words linked to that particular opinion word. This allows you to extract the aspect term, which you will see in the image below. Then, you just need to define a set of rules to set the correct sentiment score to the opinion word (i.e. flipping the sign of the sentiment when negation words are present), and assign that sentiment score to the aspect term that it’s referring to.

By this time, we now have the sentiment score for each of the aspect terms extracted: {waiter: -1, music: -1}. The next and final step is to add these aspect terms with sentiment scores to the broader aspect categories [food, service, price, ambience, anecdote/misc.]. I do this in a hybrid approach, as I briefly mentioned above. I first try to assign based on the similarity of the aspect term to the aspect category with word2vec’s n_similarity, using a word embedding that was pre-trained on the Google’s News dataset. I set a decently high threshold for the similarity value, because the model needs to be fairly confident if I’m not taking into account any context around the surrounding words. If the word fails to meet the threshold for the proximity in the two words in the vector space, the algorithm falls back on using the category of the entire sentence that was classified previously using ML-NB (more details for making this decision can be found in my github).

Applications for Businesses

Now, what are some potential insights we can get after all of this? I ran this pipeline for a restaurant listed on Yelp with ~1,600 reviews, which took about 20 minutes to run on my EC2 instance (there’s clearly some room for optimization). From the resulting dictionaries, I used Plot.ly to generate some simple plots. What you see below is the percentage of positive versus negative sentiment for each of the aspect categories at large. This should give the business owner a sense of which aspects the customers are satisfied about their business, and which ones they need to improve on. For reference, this particular restaurant was rated 4.0 on yelp.

% of positive & negative sentiments for each aspect

Of course, this only tells you the simple ratio and not the volume, so you probably also want to find out which aspects people are most opinionated about. As expected for a restaurant, people seem to be mostly concerned about food.

Total # (positive + negative) sentiment in each aspect category

The broad aspect categories are useful for assessing the performance at large, but if we were looking for actionable insights, we probably want to dig a little deeper. Since we parsed the specific aspect terms using spaCy’s dependency parser, we have the dictionary available to do further analysis. For example, we can simply look at the aspect terms with the most positive or most negative sentiment, and pick out a few that are meaningful (as there’s a good number of noise as well).

Now we have actionable insights that the business can use to tailor to customers that were dissatisfied with certain aspects of the restaurant. Maybe it’s time to change the music that’s been playing or stop serving those pickles!

Applications for Customers

What about for the customers? Let’s say we’re trying to decide between two restaurants in a nearby location with two very similar ratings and number of reviews. If you’re like me, you probably look at the rating, some pictures, and then skim through a couple of the top or most recent reviews to get a sense of people’s sentiment over the restaurant. But what if we could generate an unbiased comparison over all of the reviews of the restaurants in each of the different aspect categories that we care about?

Radar plot of two restaurants of similar type and ratings. The metric used was #pos/#neg sentiment scores.

By generating a simple radar plot for each of the aspect categories, one can see that as expected, the customers for the one dollar ($) restaurant are relatively more satisfied with the price, given that it’s a cheaper restaurant than the second restaurant. However, the general sentiment over other categories seem to be in favor of the second restaurant with two dollar ($$) signs. Given this more comprehensive information over the sentiment of reviews in each aspect categories, customers can now make smarter decisions more quickly without having to do a lot of work.

Discussion/Future Work

I laid out a preliminary framework for a potential approach to aspect-based opinion mining, but there are a ton of improvements to be made. In fact, I wanted to create a flask app to demonstrate the exception cases when my sentiment analysis fails to assign the sentiment value to the correct dependency word, but I simply didn’t get to finish writing the code within the timeframe (I will post a link to the app as soon as I host it in the near future if there’s enough interest). Despite the potential pitfalls, the algorithm performs fairly well on simple test cases, and I am excited for this type of approach to be utilized by businesses more often in the future, as it would take minimal resources to simply run the pipeline once a robust framework has been built. It should be fairly easy to optimize and tweak the pipeline in ways that suits specific needs of the user.