Keywords Extraction with Ngram and Modified Skip-gram based on spaCy

Irene Yang
Reputation.com Datascience Blog
7 min readMar 18, 2019

With the spread of social media, online reviews are one of the most important sources to know what people think about a company’s reputation. Keywords extraction is one of the most common technique to summarize information from the text. In some cases, keywords are not only single words but phrases, or even phrases where words are not directly connected to each other. For instance, “The doctors there are good” and “They have good doctors” can be both interpreted as “good doctor”.

In this case, Ngram and Skip-gram generator could help us generate potential phrases. Moreover, the modified skip-gram based on spaCy dependency tree can be a powerful tool to extract those skip-phrases. (SpaCy is a free open-source library for Natural Language Processing in Python.) In the end, by comparing the frequency and pointwise mutual information score, we can do a key phrase extraction.

Goal

We aim to extract meaningful multi-gram keywords using both ngram and modified skip-gram generator and compare their performance. To be more specific, we only consider the phrases with two words. Our case study in this blog focuses on the healthcare industry.

Methodology

Overall, the entire process has three parts: basic preprocessing which remove non-ASCII tokens and split corpus to sentence level, gram generator, and scorer. The raw corpus will be preprocessed and passed to different generators. The final output will be multi-gram keywords with their scores.

Fig. 1. Flowchart of the entire method

Gram generators

We created two different generators: ngram generator, and modified skip-gram generator. The detailed steps for these two generators are explained in the following flowcharts.

  • Ngram: A preprocessor is applied to clean and standardize the keyword phrases before the Ngram generator. This preprocessor removes stop words, punctuations, numbers, and symbols, and lemmatizes the words. After getting the clean sentence level corpus, the n-grams are all possible adjacent n-words. For instance, if the clean input sentence is “They make me feel like family.”, the bigram pool will be (“they”, ”make”), (“make”, ”me”), (“me”, ”feel”), (“feel”, ”like”), (“like”, “family”).
Fig. 2. Ngram generator flowchart
  • Modified Skip-gram: The modified skip-gram generator is based on the spaCy dependency tree. Instead of preprocessing texts like ngram, we applied a post-gram-selection based on syntactic dependencies to remove stop words, numbers, and punctuations, etc. The sentence level corpus will be parsed directly to a tree structure so that the original sentence structures are retained. Each token might have several children, whereas each child only has a single parent(root). Lemmatization is applied for each “(child, parent)” skip-gram pair before post-gram-selection.
Fig. 3. Skip-gram generator flowchart

Using the “They make me feel like family.” example, the tree for this sentence is displayed in Fig. 4. If two words are connected by an arrow, they are dependent on each other. For each parent (token), the arrows point to its child/children. In this case, the skip-gram pool will be (“they”, “make”), (“feel”, “make”), (“.”, “made”), (“me”, “feel”), (“like”, “feel”), (“family”, “like”).

Fig. 4. Dependency tree example

After getting all skip-grams in the pool, the post-gram-selection process will help us remove the gram pairs that having syntactic dependencies associated with stop words, punctuations, symbols and some designated less meaningful grams. Table 1 below is an example of post-gram-selection for “They made me feel like family.”. The annotation under each arrow represents the relationship between (child, parent) pair. Relationships including “punct”, “ccmp”, “prep”, “pobj”, … will be removed from the gram pool.

Scorer

We calculated three scores of each gram: frequency, pointwise mutual information (pmi), and weighted pointwise mutual information (wpmi).

  • Frequency: the number of occurrences of specific phrase over the number of all phrases.
  • The pointwise mutual information (pmi):

Intuitively this is the probability that two words appear together as a phrase over the probability that each word appears. The higher the score, the higher the probability that these two words appear as a phrase.

  • The weighted pointwise mutual information: frequency*pmi

Grams with higher scores are more important and meaningful. The wpmi is considered as the most important metric for keywords extraction because it takes both frequencies the pmi into consideration.

Case Study

We sampled 2000 google comments for the same healthcare entity in Colorado (CO) and California (CA) 2018, applied both bigram and skip-gram processors to the sample and compared their performances. In addition, we implemented our optimal method on comments in CO and CA separately and compared the differences between two locations.

For the word cloud, the grams are sized by the normalized “wpmi” and colored by the normalized average rating, where green represents positive and black represents negative.

Comparison between bigram and modified skip-gram

We compared the performance of ngram and modified skip-gram. There are two advantages of skip-gram compared with ngram. One is that the order of words in the sentences doesn’t matter since the syntactic dependencies are predefined. For instance, “The doctor is good” and “They have a good doctor” will both be interpreted as “good doctor” based on the skip-gram. But, it would be different two phrases: “doctor good”, “good doctor” if we apply the bigram technique. The other advantage is that the post-selection process allows us to filter out some meaningless phrases, like “make feel”, “feel like”, etc. Thus, the skip-gram will allow more meaningful phrases to stand out. Also, to improve the performance of skip-gram, we merged keywords with the same meaning (eg. “good doctor” and “excellent doctor”) when implementing skip-gram.

  • Tables: Lists of top 10 keywords sorted by weighted pointwise mutual information (wpmi) and their other scores.
  • Wordcloud:
Fig. 5. Word cloud of keywords using Bigram
Fig. 6. Word cloud of keywords using Skip-gram with clustering

These keywords summarized what customers cared about. The colors in the word cloud show the average ratings that correspond to these keywords. According to Table 2, Fig. 5 and Fig. 6, both bigram and skip-gram can extract keywords from the comments, like the “emergency room”, “urgent care” and “customer service”. Moreover, customers’ reviews about these keywords are usually negative. However, comparing the results from the bigram and the modified skip-gram, we can see that skip-gram allows more useful keywords to stand out with a higher ranking. Within the top 10 keywords generated by bigram, there are some meaningless keywords like “make sure”, “feel like” and “make feel”. The keywords from skip-gram, instead, are all meaningful and give more insights about “wait time”, “nurse doctor” and the customers’ experiences. To conclude, the modified skip-gram performs better and draw more insights from the texts.

Operational Insights from modified skip-gram

Reputation.com helps companies manage their online reputation and find actionable operational insights from customer reviews, surveys, and social media. We monitor the feedback our clients receive from consumers and provide insights about topics that people care most about. We help businesses identify strengths, weaknesses, and sentiment broken down by topics by different locations, and over time. The modified skip-gram enables us to better understand what customers value and how they feel about their experiences. To be more specific, we implemented the modified skip-gram on customers’ reviews from California and Carolina in the healthcare industry and compared the differences.

Fig. 7. Word cloud of keywords of CO using modified skip-gram with clustering
Fig. 8. Word cloud of keywords of CA using modified skip-gram with clustering

Based on Fig 7. And Fig 8., we found that the sampled customer reviews in CO are more positive than that in CA. Overall, customers from both states cared about “customer service”, “urgent care”, “wait time”, “x ray”, etc. However, the reviews about customer service and x ray in CO are more positive than that in CA.

We can get more insights into these two locations separately. For the business in CO (word cloud is shown in Fig. 7), customers felt good about customer service, doctors, flu shot and friendly nurse. However, the most frequent things that customers complained about were urgent care, appointment schedule, and wait time. In CA (word cloud is shown in Fig. 8), the urgent care, wait time, emergency room and customer service topics drove the negative customer reviews. Companies should pay more attention to these areas to improve their customer experiences. As for the strengths, the doctors and nurses did a good job and left positive impressions.

With the modified skip-gram, we can draw more meaningful insights from customers’ feedback and figure out what customers care and where should be improved to enhance the quality of services. The Reputation.com helps companies understand their business better. We not only manage the online reputation in multiple locations, different categories but also figure out where the company should pay more attention to improve the customers’ experiences.

Many of our customers use the word cloud as a tool to uncover issues and perform root cause analysis. By understanding the problem areas and locations that need improvement, companies can take action to implement better processes, train employees, and prioritize investments to improve customer experience. Those improvements increase customer happiness which translates into a better online reputation, more customers, and ultimately, better returns.

References

spaCy

--

--