KeyPhrase extraction using sentence embeddings(Unsupervised Learning)
Our Use Case: was to generate key phrases (bi-grams or tri-grams) from reviews instead of generating 1 word topics. 1 word topics do not give a wholistic view of what is being talked about a product in the market but a phrase helps us understand more about whether the words are being talked about in a positive or a negative sense. I started with topic-modeling using n-grams and also tried generating high frequency words with tf-idf and Rake which did not result in phrases which actually captured the high frequency phrases.
Therefore, I decided to build up from scratch instead of using a library and used embeddings as discussed in this paper.
Here I have discussed our whole approach in detail. Hope it helps you.
I did only a limited amount of preprocessing i.e.
- removing all punctuations and symbols.
- converting all the reviews to lower case letters.
Then I created a list of all the candidate keyphrases using simple rule based matching. I used spacy’s en_core_web_lg model to get parts-of-speech tags for each word. I extracted phrases consisting of zero or more adjectives followed by one or multiple nouns or proper nouns.
Now that I had a list of all the candidate key phrases, I needed to select those reviews that denoted the phrases that were used frequently. The major difficulty here was- synonymns. Words like : good, bad mean the same thing but wont be selected as the same keyphraseadding redundancy to the selected phrases. I solved this via calculating the cosine similarity and MMR in the vector space. I used Sent2Vec to convert sentences to vectors. Sent2Vec is an extension of Word2vec and can conveniently represent arbitrary length English sentences as a Z-dimensional vector. It reflects semantic relatedness between sentences when using standard similarity measures on the corresponding vectors.
I pretrained the sent2vec model on a combination of wiki bigrams data(wiki_bigrams.bin) and bigrams from our review corpus. This helped the model not only learn the formal wikipedia language style but also the informal style of writing reviews. All the reviews combined together form 1 document which is then put into the same vector space as the wiki bi-grams data.
Then I compared the cosine similarity of each of the candidate keyphrases to the document vector one by one. If the cosine similarity is high, I checked its cosine similarity with the other keyphrases which has already been selected. If the cosine similarity of the phrase to the document vector is high and it is also not close to any of the other phrases selected, I kept it. Otherwise, if it was close to any of the other phrases, I discarded the phrase. This technique is called MMR.
If the MMR for 2 phrases is very high, I kept only the first phrase and discard all the phrases that was similar to the 1st phrase.
Now if you want to link the reviews back to the key phrases, you can follow the same approach you did earlier. Just replace the review vector with the key phrase vector.
- Replace the single document of app reviews with the given key-phrase
- Replace the list of candidate key phrases with the list of reviews
- Apply the same technique as before without MMR — as we do need redundancy here.