Conducting Sentiment Analysis on App Reviews to Inform Product Decisions

Published in

Life at Hopper

9 min readJun 13, 2019

By: Tamir Bennatan, Hopper Data Scientist

Hopper is hiring — check out our current job openings.

We at Hopper have always appreciated the importance of collecting and analyzing user behavior data - and we have a lot of user behavior data. Users set up 4.5 million “Watches” per month, telling Hopper which trip they’d ultimately like to book and trusting the app to monitor prices for them. Requesting a “Watch” exposes users to notifications on price changes, flash sales, and alternative deals. We send around 1.5 million push notifications per day, and the average Hopper user launches the app once every six days. All this amounts to billions of data points on user behavior per month. It is important to note that we do not distribute personal information.

We use all of this data in many ways — from improving our forecasts to building recommendation systems that help travelers find the perfect flight (in fact, more than 20% of our sales are driven by these recommendations: trips that users never asked for but the app knew to suggest) — all to better understand how we can make Hopper more valuable to our users.

In a perfect world, we would interview each Hopper user and ask them specifically what they like about the app, and what they’d like us to change. But, with over 40M users from around the world, collecting this type of explicit feedback might take a while. Instead, we need to infer users’ successes and pain-points from behavioral data — which acts as a form of “implicit feedback.”

Although we learn a lot about our users from behavioral data, deciphering how users feel from implicit feedback will always be more difficult than simply letting them tell us themselves. That’s why whenever we have the chance to utilize explicit feedback, we make the most of it.

One such source of explicit feedback is app reviews: hundreds of thousands of users have rated Hopper with a star rating, along with a free-form text review on what they do/don’t like about the app:

There’s no limit to the ways users can deliver feedback in a text-based review. This makes this natural-language dataset very powerful, but also difficult to work with, as it’s more unstructured than tabular data.

One way to learn from reviews data is to manually read them. Although this provides valuable anecdotes, one cannot hope to aggregate high-level insights into users’ satisfaction levels, qualms, and requests by reading reviews on their own, due to the sheer volume of reviews coming in each day. In order to extract these insights, we must employ Natural Language Processing techniques that are designed to help understand patterns in large text datasets.

Let’s break down how we use a technique called Sentiment Analysis to learn from unstructured data in our app reviews, and how we use these learnings to make Hopper users’ lives better when planning travel.

How “Happy” is this word? Dictionary-based sentiment analysis on reviews

“Sentiment Analysis” is the automatic process of extracting the attitude of an author towards their subject matter from written or spoken language. In our context, we want to understand a reviewer’s opinion — either positive or negative — about Hopper from their review.

Unlike other natural language datasets, such as Tweets or Reddit comments, app reviews come “pre-labeled” with a summary of the author’s sentiment — namely the star rating. Presumably, authors who leave a five-star rating have a more positive outlook towards Hopper than those who leave a one-star review. We monitor our star ratings to make sure that overall sentiment towards Hopper is improving over time.

While star ratings provide a summary of how users feel about their experience, they don’t tell us why users feel the way they do. What about the app do they like and would want to see more of? What frustrates them about our app and would like to see changed?

To address these questions, we need to dig into the reviews themselves and try to understand what led the user to give the rating they did. If we assume that the overall sentiment of a review is composed of the sentiment of each word in that review, then the positive/negative words in a review give us a clue into what led the user towards their overall sentiment — the star rating.

To retrieve the sentiment of each word, we can use a hand-labeled dictionary, which assigns a “sentiment score” to each word. One popular dictionary is the AFINN Lexicon, which labels each word in the lexicon with a sentiment score between -5 (most negative) and 5 (most positive).

For example: consider two example reviews with one and five-star ratings, respectively. Positive words (AFINN score greater than 0) are highlighted in green, while negative scores are highlighted in red:

As expected, users who leave reviews with poor ratings are more likely to use negative words in their reviews, while users who leave positive reviews use more positive words. This supports the idea that a review’s overall sentiment can be approximated as the sum of the sentiment of each constituent word.

Looking at the two example reviews above, one may see how identifying strongly negative and positive words in a review informs us as to why the user feels negatively or positively about Hopper. In the first example, the words expensive and convoluted hint that this user is not impressed with the prices and flight options he/she saw. In the second example, the words easy and intuitive show that the user was pleased by a smooth experience on Hopper.

By investigating the commonly used words with positive and negative sentiment, we can better understand which aspects of Hopper people appreciate and features we should continue to build out, and where we need to improve the most.

Positive reviews: What we should do more of

Focusing on reviews with ratings of four or five stars, we may look at the most common words with non-zero sentiment scores (meaning that they’re of either positive or negative sentiment, but not neutral).

All of these words are of positive sentiment, and we can quickly get a feeling for why positive reviewers like Hopper. They love how easy it is to find the best, cheap flights, and find Hopper helpful and accurate.

Although word frequencies give us an intuition for common themes in positive reviews, we can’t deduce the context in which these words are used. We want to know what users love, what they find easy, and so on.

To understand the context in which words are used, we can use a visualization called a Co-occurrence Graph. This visualization shows words as nodes, and connects words that are frequently used in the same review with edges. Edges “pull” words together, so that words that are spatially close to each other in the graph are either frequently used together or share words that they are frequently used with, and thus are likely to be semantically similar.

If we study this graph, interesting stories begin to arise from the connections. For example, consider the following subgraph of highly connected nodes:

This paints a clear picture of what users who leave positive reviews enjoy about Hopper: They like how by watching trips they receive notifications about the best prices, which helps them save time and money. This view helps to validate that our users find value in our Watch feature and push notifications, and that continued work on our forecasting and notification algorithms is worthwhile in our users’ eyes.

Negative reviews: We still have our work cut out for us.

Looking at the common words in negative reviews (three stars or less) paints a more nuanced picture. While words with negative sentiment, such as expensive and useless, are much more common, the sentiment of commonly occurring words, such as cheaper and better are context-dependent.

Words like “cheaper” and “better” are called comparatives: they are used to express comparison between two or more entities. This begs the question: “What are our users comparing between?”

Perhaps unsurprisingly, our users are comparing our flights to those of our competitors; common comparators such as “better” and “cheaper” frequently co-occur in our reviews with the names of our competitors. The chart below shows that reviews that use a comparator (red) are much more likely to mention one of our competitors than reviews that don’t use a comparator (blue).

Furthermore, bad reviews that use comparators are much more likely to mention competitors than good reviews that use comparators. This suggests that finding “better” or “cheaper” flights elsewhere is a very negative experience for Hopper users and motivates them to leave negative reviews.

There’s a lesson to learn from this finding. Our users — whose price sensitivity plays a large role in why they use Hopper in the first place — feel betrayed when they find better prices elsewhere. As such, no improvement in our design, alternative trip recommendations or marketing will make a difference if our prices are not better than those of our competitors. This taught us that we need to focus our efforts on ensuring our prices are competitive, every time.

Using these findings to inform product decisions

Through this sentiment analysis and other analyses, we’ve validated that watching trips and receiving price alerts is a great way to help users save time and money when making travel plans. We also learned, however, that finding a lower fare or a better flight on another platform leads to a disproportionately negative experience, and that we must strive to offer competitive prices in all cases.

Empowered by these insights, we prioritized projects that expose the best and cheapest flights to our users. We continuously test and improve our price forecasting algorithms, so that users who watch flights are more likely to save big bucks. We also work hard on our app design to ensure that users understand any restrictions that apply to the ticket that they’re purchasing.

And the impact speaks for itself. In 2018, we increased the average dollars saved per watch substantially, which correlates with a larger proportion of users leaving five-star reviews, and fewer reviewers complaining that fares are cheaper on other sites.

Explicit Feedback: The Next Frontier

By using simple text-processing techniques, we can extract rich and actionable insights from our app reviews data.

This attests to the power of explicit feedback more broadly: removing the uncertainty around user intent inherent in implicit feedback makes for rich learnings.

It’s a challenge to collect and process explicit feedback in a mobile application in a way that feels natural and helpful to users. At Hopper, we are working on several projects that will do just that — whether it be an in-app survey that uses responses to inform our recommendation systems, or monthly customer satisfaction questions delivered through push notifications that are used to compute user health metrics — we want to allow our users to provide feedback which makes a real, visible impact to Hopper’s behavior.

But that’s the topic of a future post…

We’re on the lookout for smart problem solvers to join our team. Interested? Check out Hopper’s current openings.