Using Natural Language Processing to Classify and Analyze User Feedback

Illustrations by Mengdi Zhang

Hello! My name is Jayant Madugula and I’m currently a senior at Columbia University studying Computer Science. This was my second summer interning with the iOS team at Jet. During my first summer at Jet, I saw just how important customer feedback was to design and development. As a result, Jet collects a large amount of text-based feedback. My project this summer was to build a system that could automatically categorize and analyze this large (and constantly expanding) corpus of text data. The resulting pipeline could then be used to streamline existing workflows that categorize, analyze trends, and respond to customer feedback.

The project had two main parts. The first was to build an aggregate analysis tool that augments existing analysis using feature extraction (described below) and frequency analysis. For the second part, I built a classification system that makes a series of yes/no decisions on whether a review belongs to various categories. Examples of these categories include shipping, pricing, and app experience. Both parts of this project rely on the same basic process: data collection to preprocessing to feature extraction. The second part of the project appends a classification step.

Tools

For both parts, I used a series of natural language processing (NLP) techniques and machine learning models. I used Python 3 along with a number of libraries focused on NLP and machine learning to implement the project. I mainly depended on spaCy, textacy, sci-kit learn, pyLDAvis, and pandas.

spaCy and textacy are high performance NLP libraries that helped me with preprocessing, feature extraction, and classification. I also used sci-kit learn for classification. pyLDAvis offered an easy way to perform topic modeling on the data, which helped me understand which clusters of words were most important. Finally, pandas allowed me to efficiently import, manipulate, and save data.

Data

The project began with gathering data. Being an iOS intern, the Jet iOS app’s reviews were a natural first source. However, we didn’t have access to enough reviews, forcing me to look for additional data.

Net Promoter Score (NPS) reviews turned out to be exactly what I needed. Customers who submit NPS feedback are asked how likely they are to recommend Jet from 1 to 10 and are presented an open text field for a full review. This format is very similar to iOS reviews, allowing me to use similar approaches on both datasets. The NPS reviews were manually tagged as well, which greatly helped with training the classification models.

Of course, there were important differences between the iOS App Store reviews and the NPS data. Customers heavily focused their NPS feedback on the pricing, shipping, and delivery experiences. Understandably, iOS reviews discussed the app experience more frequently. Also, while we send a survey asking for NPS responses, our iOS app doesn’t prompt users for reviews. Therefore, app reviews are left by customers who actively go to the Jet App Store page. This means we don’t know where these customers are in the shopping or app experience.

Now that I had enough data, the next step was finding actionable information within the unstructured text.

Preprocessing

The goal of preprocessing is to standardize the data in order to improve future analysis. One of the first trends I noticed in both the NPS and iOS data was a combination of casual and formal language. While some reviewers left long reviews with proper grammar, spelling, and sentence structure, many other reviewers wrote in far more conversational language.

Contractions were one of the most important differences between casual and formal reviews. They directly impacted the aggregate analysis system since it relied on the frequency of various tokens. For example, “don’t” and “do not” would be handled separately despite having identical meanings, potentially altering results. This approach was based on experimentation as well as existing information about similar NLP tasks. To standardize contractions, I downloaded a table of contractions from Wikipedia and mapped them to their expanded form (step 1 in illustration).

I also performed more standard preprocessing steps: lowercasing (step 2), removing stop words, and lemmatization. Lowercasing removes distinctions between word placement. For instance, “Shipping” (as the first word in a sentence) and “shipping” (anywhere else in a sentence) should, for our purposes, be treated as the same token. This has a side benefit of bridging the gap between formal and casual language as well, since many casual reviews would use all upper or all lower case. Stop words are defined as words that do not contribute to the meaning of a sentence and appear very often, potentially skewing training or general analysis. A few common examples of stop words include “the”, “and”, and “I”. I used standard stop word lists from NLTK and spaCy’s en_core_web_lg model.

Finally, lemmatization (step 3) allows us to treat different tenses of the same word as an identical token, while respecting the word’s part of speech. Thus, “shipping” and “shipped” would be treated as the same token. The value of this approach turned out to be immense, as different tenses would often be used in different contexts by our customers. For instance, in a sample of 5,000 NPS reviews, reviews that contained the word “shipping” had a significantly higher average score than reviews that contained the word “shipped”. Furthermore, lemmatization helps distinguish between examples such as “they ship packages in two days” and “I am really happy with the toy ship I bought”.

The final step was removing personal or highly specific text from the corpus. This included reviews with email addresses, locations, numbers, and emoji. While having an email address is useful for Jet Heads to communicate with customers, they do not help an automated system looking at hundreds of thousands of reviews. However, the fact that a customer left their email in a review may have a bearing on what that review contains, so we replace these specific examples with a generic token (e.g., lizzy@jet.com becomes *email*).

Feature Extraction

Once we have preprocessed the text, the next step is feature extraction. Feature extraction is the process of pulling relevant information (“features”) out of data. We can look at the features we get from this step and run frequency analysis on them, greatly helping with our aggregate analysis. For classification, extracting more features means we can input additional features in our training and testing data, hopefully resulting in better classification models.

For this project, I focused on the following features:

  1. ngrams (1, 2, & 3)
  2. noun chunks
  3. subject-verb-object (SVO) triples
  4. key terms, via the SGRank algorithm

ngrams can be thought of as n consecutive words. Thus, a unigram (1-gram) is simply a single word, a bigram is a pair of words, etc. Noun chunks are phrases centered on a noun with surrounding descriptors attached. Key terms are terms in a given document that are, according to the SGRank algorithm, likely to be important to the meaning of the sentence. Click here to learn more about SGRank.

Another algorithm, called most_discriminating_terms, returns the terms that are most likely to indicate whether or not a review should be given a particular label. See here for more information. We can use this with preprocessed text, but also with the extracted terms above for any given review. Not only is this helpful in determining which terms are most used to discuss various parts of Jet's business (i.e., within shipping, do customers discuss speed, packaging, price, etc.), I was also able to create positive and negative filters for each label.

Filtering

The idea for filters came from an observation that many of the topics discussed in our reviews are expressed using a relatively small set of words.

For any given label, a positive filter is a list of words that, when they appear in a review, highly indicate the review has to do with shipping. A negative filter is the exact opposite — a list of words that highly indicate the review does not have to do with shipping. For instance, a positive filter for the shipping label could contain “shipping” and “was delivered”, while the negative filter could contain “great app” and “item selection”.

Similarly to feature extraction, the constructed filters are themselves useful to the aggregate analysis system. Knowing which words and phrases highly indicate a label allows us to summarize how customers are discussing various parts of the Jet experience. For example, under the umbrella of shipping, if packaging and delivery speed are ranked highly in the positive filter, we would know these topics are particularly important to customers.

The goal of filtering in the classification system was to identify “obvious” cases, positive or negative. Based on the overlap between a filter and a review, a score is assigned. This is done for the positive and negative filter for each label. From here, there were two options:

  1. Set a positive and negative threshold. Reviews with scores either above the positive threshold or below the negative threshold are given the appropriate label. All other reviews are classified by a machine learning model.
  2. Count the score as a feature, then send all reviews to a machine learning model for classification.

In investigating how the filters performed, I found both the positive and negative filters were quite successful at identifying “true positive” results. A “true positive” result for a positive filter correctly applies is label to a review (based on a known training label), while a “true positive” result for a negative filter correctly does not apply the label to a review. “True negatives” correctly do the opposite. “False positives” and “false negatives” are when the respective actions are applied incorrectly.

Results from filtering for shipping reviews

In the graphs above, we see that both the positive and negative filters have very few “false positive” results as threshold increases. This suggests that "positive" results can be trusted. The high number of "false negatives" however means that we cannot fully trust "negative" results from either filter. The threshold values are based on the overlap between each review and the corresponding filter along with the length of the review. These results support the idea of using filters to find the "obvious" cases ("positive" results), while sending the more complex reviews to the machine learning classifiers.

Classification — Training and Testing Models

The classification step for my project was relatively straightforward. spaCy provides a TextCategorizer, which implements a Convolutional Neural Network (CNN) behind the scenes. The TextCategorizer was quite simple to implement and immediately gave solid results. Training on a sample of ~5k reviews, I got the following preliminary results (each model trained for 10 iterations):

  1. “customer-experience” label: precision value of 0.844 and F-score of 0.896.
  2. “pricing” label: precision value of 0.958 and F-score of 0.969.
  3. “shipping” label: precision value of 0.905 and F-score of 0.897.

I also tried scikit-learn's Logistic Regression and SVM models. These models did not outperform spaCy's TextCategorizer, though they still performed quite well.

Next Steps

Unfortunately, my summer ended before I could experiment with more combinations of filtering, features, and models, but the early results were promising. The results posted above are based on preliminary training runs on small subsets of the total available data, suggesting performance will only increase when the full datasets are used. From the tests I ran, I found it interesting that model and filter performance changed noticeably for different labels, suggesting a one size fits all approach is not ideal in this case. Instead, a pipeline composed of unique binary classifiers for each label could be the best option.

Further testing will help us better understand the underlying data and how best to parse and classify it. For instance, looking at whether preprocessing is necessary before training different types of models and if filtering truly helps improve the overall pipeline’s results are still open questions. As an example, CNNs generally perform better than other classifiers without preprocessing. For filtering, trying different thresholding approaches may be another way to improve results. The current thresholding formula penalizes longer reviews that discuss multiple topics, something that has no bearing on whether the review belongs to a particular label.

Overall, I had a blast putting together a full pipeline from data collection to classification. Automatic classification will help us better categorize and track specific areas of feedback over time, while the aggregate analysis pipeline will allow us to better understand customer feedback and prioritize features to improve the shopping experience at Jet! I had an amazing time at Jet this summer and I want to thank to the iOS and Research teams for helping me throughout the internship!