Review-based Search Engine using Universal Sentence Encoder and PEGASUS

Jung-a Kim
5 min readSep 17, 2020
Photo by Evgeni Tcherkasski on Unsplash

With the rise of e-commerce market, reviews are becoming more influential on consumer’s decisions. According to BrightLocal’s Local Consumer Review Survey in 2014, 88% of consumers trust online reviews as much as personal recommendations. Also, on average, consumers read about 10 reviews before making a purchase.

Aligned with this trend, I came up with a review-based search engine which retrieves the optimal products that are positively relevant to the user’s query and outputs the product’s most relevant pros and cons from reviews that consumers would want to check before making a purchase.

The dataset used for building this search engine is public Amazon reviews dataset for electronics between 1999 and 2015 which contains about 3 million reviews and 160k unique products.

Workflow of building the review-based search engine

The engine requires three types of models that need to be built. First, a sentiment analysis model trained by a deep neural network using sentence embeddings. Secondly, a similarity matrix to compare reviews and a query. Lastly, summarization model to summarize reviews that are too long.

I used Flask to build an app to show you how the engine actually looks like at user’s end.

So, how does it actually find the products that match the key qualities that users are looking for? It is based on a metric named as ‘Positive Similarity score’ which measures how relevant a product is to the query based on its reviews. Positive Similarity score computes three scores with the aforementioned models. First, the sentiment scores for each sentence in a review are computed. It is a binary classification score where star rating 5 is considered positive sentiment and otherwise it is negative sentiment.

The sentiment analysis model is trained with a deep neural network which is sampled from this article with a little bit of fine-tuning. The model achieved 82% accuracy on the unseen test set.

Secondly, two types of similarity scores between the products and the query are computed and minimum score is returned to be conservative about the similarity. Let me explain what this score is about.

The similarity scores are the average scores of the maximum similarities between a query and each review sentence in an effort to catch similarity with any part of the query. While reviews are tokenized into only sentences, a query is tokenized in two ways: sentences and words. The reason behind this is that some queries have distinct keywords while some have context-based words. For example, if a query is “I want a backlit, wireless, quiet keyboard.” Then the bag of words ‘backlit’, ‘wireless’, ‘quiet’ explain most of what the query is. On the other hand, if a query is like “I want a keyboard that has a backlight and works remotely with soft keystrokes.” Then the words in the sentence like ‘works’, ‘soft’ become ambiguous without the context.

Lastly, a user can select keyword boxes that appear in the app to put more weights on them than the other words which will be shown later in the app demo.

Figure 1: Sentence similarity scores using embeddings from the universal sentence encoder. (Image Source from the paper: Universal Sentence Encoder, Daniel Cer and others, 2018, arXiv:1803.11175)

Before putting the reviews and the query into the models, they need to be converted to a high-dimensional embeddings. Among many encoders, Universal Sentence Encoder seemed to fit the purpose of this engine the best.

Universal Sentence Encoder(USE) is a transformer-based sentence encoder invented by Google Research team in 2018. It is element-wise sum of context-aware word representations, like BERT, divided by the squared root of sentence length to prevent advantage to long sentences.

At this moment, you may wonder why not just use BERT which is also context-aware embeddings?

BERT is particularly trained to predict the next following sentence and predict missing words in a sentence. For example, “How much does this cost?” and “Is this expensive?” are similar to each other, but the similarity score by BERT embeddings is pretty low since they are less likely to be followed by one another in the same document.

On the other hand, USE is trained to identify the similarity between the pairs of sentences. The evaluation metric is semantic textual similarity scored by humans along with Pearson’s correlations.

Now, let’s take a look at the search engine app which is much more intuitive than explanation in words.

Review-based search engine app demo (templates by Colorlib)

In the demo, I typed ‘noiseless, wireless keyboard’ and some of the reviews are retrieved with summarization which the three dots in front are indicating. In the second example of ‘21 inch light cheap monitor’, the summarization says “a baby monitor”, but the original text says in exact words, “infant monitors” and “I can see the baby”. This type of summary is ‘abstract summarization’ which doesn’t merely copy the sentences in the summary.

The summarization model used here is PEGASUS(Pre-training with Extracted Gap-sentences for Abstractive Summarization), a transformer encoder-decoder model developed by Google.

Figure 1: The base architecture of PEGASUS (Image Source from the paper: PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization, 2019, arXiv:1912.08777)

PEGASUS follows BERT’s masking algorithm in the encoder and feed the embeddings into the decoder to predict masked important sentences and concatenate them into a pseudo-summary.

PEGASUS proved state-of-the-art performance on 12 downstream datasets in terms of ROUGE1-F1 score. The datasets mainly were news articles including CNN and BBC, and published journals like arXiv, and lastly Reddit TIFU stories. In my opinion, sub-reddit stories resembled the tone of Amazon reviews the most among those 12 datasets, so I chose PEGASUS trained on 120K stories of Reddit TIFU.

My findings about this search engine is that the engine mostly returns relevant products for users. But the time complexity of positive similarity score which combines three matrices is quite high albeit vectorization.

The biggest caveat of this engine is that it cannot detect a category given a product title using only USE sentence embeddings. I could have tried transfer learning to train these embeddings with product titles to detect its categories.

PEGASUS is an amazing summarizer, but at the same time, it takes up to 2 minutes to summarize one long text. If there were another distilled version of PEGASUS, I would definitely try them out!

If you want to see the codes click this link!

--

--