May 14 · 11 min read

Section 1 Background Story

In the world of e-commerce, customers leave thousands of reviews for the products they bought. These reviews contain rich information which could be very useful to new customers who are also interested. However, for some popular products, there are many reviews (more than 1000) talking about different aspects of the products. It is inefficient for customers themselves to manually find what they are really interested in. If we could summarise those reviews and put meaningful tags (e.g. hashtags) in front of all reviews, then customers could easily read what they want to know by clicking different tags. This process could improve the user experience and enhance user engagement.

Product tag extraction is already becoming a trend in the world of e-commerce. The world’s largest e-commerce companies like Amazon and Taobao already have this feature in place.

Review Tags on Amazon
Review Tags on Taobao

Section 2 Literature Review

Generally, there are two main methods to extract keywords/tags from product reviews, namely supervised learning and unsupervised extraction. In our project, we choose an unsupervised method because supervised learning requires lots of data to be labeled and is difficult to scale in different categories of products and different websites. Unsupervised extraction is more scalable and more robust to different product types.

Section 2.1 Frequency-based Method

One of the earliest studies in which the author seeks to extract information from customer reviews is presented in “Mining and Summarizing Customer Reviews”[1], where the authors define some frequent feature patterns to extract people’s opinions. The candidates extracted are simply ranked by the frequency. Other research (e.g. [2][3]) combines this original idea together with some NLP technologies like part-of-speech tagging and dependency parsing, which are called the frequency-based method in general.

Section 2.2 Graph-based Method

The graph-based approach, originally proposed by Rada Mihalcea and Paul Tarau, provides another way to solve the problem of keyword extraction. They call the graph-based method TextRank[4], which roughly follow the classical PageRank algorithm. In TextRank, words are defined as vertices and if words are within a context window, an edge links them together. The importance of each word is obtained by running PageRank coverage algorithms. Most of the graph-based approaches follow this skeleton except they have different definitions of vertices, edges, and weights [5][6][7].

Section 2.3 Neural Topic model

There are also some other approaches that use topic modeling algorithms to extract key aspects of customer reviews. The topic model-based approach offers additional benefits since it tries to learn a popular topic rather than a popular word. Latent Dirichlet Allocation (LDA) and its variants have become a popular unsupervised approach for aspect extraction[8][9]. Also, neural topic models have gained much attention with the development of deep learning technologies, which learns better topic representation with attention and CNN structures[10]. Despite the promising result of the topic-based model, they usually suffer from a major deficit that they require humans to manually assign a name to each topic and the names are usually subjective and biased. This deficit prevented the models being used in large industrial settings.

Section 3 Methodology Overview

We follow a hybrid approach, with the model framework as shown below. In general, there are three main steps: candidate selection, candidate ranking, and tag selection. Also, we add two new features for internal analysis called aspect sentiment analysis and tag hierarchy. A screenshot demo is provided below, it is built upon our internal Perception platform which is used to visualise our machine learning/deep learning projects.

Process of the product review tagging
Live demo of tagging system

Section 4 Detailed Implementation

Finally, we enter the most exciting part. This section will walk through major building blocks of the algorithm pipeline, including candidate selection, candidate ranking, and tag selection.

Section 4.1 Candidate Selection

Our candidate selection is mainly based upon dependency parsing. A lemmatisation tool is used to align words “mix”, “mixed” and “mixes” together into “mix”. The core part of the algorithm is extracted candidate patterns given predefined dependency relation rules. A dependency parser analyses the grammatical structure of a sentence, establishing relationships between “head” words and words which modify those heads. We use Spacy to get the dependency parsing relation. A visualisation of dependency parsing can be seen below. There are around 50 dependency relations between words. The predefined rules are based on domain experience and linguistic knowledge.

An example of the dependency parsing tree

For example, when the review “I use wrist wrap for gym exercise.” is fed to the dependency parser, the following tree-structured relation is obtained. We then use defined rules to find the valid bigrams “use wrist”, “wrist wrap”, “gym exercise”, which are extracted as candidates. For unigrams, we only pick up a noun, verb, and adj as candidates. Along with the iteration of all reviews, the candidate position within the sentence is tracked and the number of occurrences is also counted. The output of this phrase would be candidates with a number of occurrences and positions of all candidates in all reviews.

Section 4.2 Candidate Ranking in 5 dimensions

Candidate ranking is the core part of review tagging algorithms. We want to rank the more important tags higher and display them on the websites. “Importance” is a quite subjective conception and in our setting, we break down it into five different dimensions.

1 — Informativeness

Informativeness means how representative these tags are, to this product and to this group of people. TFIDF is a straightforward metric to measure informativeness since it reflects the number of customers that are interested (Term Frequency) as well as the importance of the term to this certain product (Inverse Document Frequency). However, this approach chooses to view each word separately without considering their relations and also the performance of simple IDF shrinks if the corpus is across a wide range of domains. Instead, we propose a new Informativeness ranking strategy called dependency relation based TextRank Domain Relevance Score(dr-TRDR), which will be covered in the following section.

2 — Phrase-ness

Phrase-ness is used for measuring how likely a bigram is a valid phrase. For example “make feel” might have a higher informativeness score since people frequently write something like “this vitamin makes me feel good”. But it is not a good standalone tag for people to understand. We measure phrase-ness by co-occurrence and PPMI calculated on a large corpus. If co-occurrence and PPMI of bigrams are under some threshold, the tag gets zero scores for phrase-ness. Also, the notion of phrase-ness is considered into our dr-TRDR algorithm.

3 — Semantic-ness:

If a tag is an important tag, it is likely other tags with similar meanings are important tags as well. Under this assumption, we measure the semantic share of different tags into our dr-TRDR algorithm.

4 — Diversity:

Ideally, we want our displayed tags to cover a wide range of topics rather than clusters in a certain aspect. A classical approach called Maximal Marginal Relevance (MMR) is applied to the tag selection stage, which will be discussed later.

5 — Coverage:

Ideally, we want our displayed tags to cover a wide range of reviews rather than cluster in a small proportion of reviews. Since these tags are built for users to explore other customers’ opinions, the impact of the tags will shrink if they only cover 10% of all the reviews. It is considered together with diversity in the tag selection stage.

Overall, in this stage, each tag is ranked by dr-TRDR with the following formula.

Section 4.3 Dependency relation based on TextRank (dr-TR)

Unsupervised keyphrase extraction is a popular area in academia and most of the solutions are a graph-based iterative approach, which originates from the TextRank paper. Graph-based ranking algorithms are essentially a way of deciding the importance of a vertex within a graph, based on global information recursively drawn from the entire graph. The idea implemented by a graph-based ranking model is that of “voting” or “recommendation”. When one vertex links to another one, it is casting a vote for that other vertex. The higher the number of votes that are cast for a vertex, the higher the importance of the vertex. Moreover, the importance of the vertex casting the vote determines how important the vote itself is, and this information is also taken into account by the ranking model. Hence, the score associated with a vertex is determined based on the votes that are cast for it, and the score of the vertices casting these votes. The graph-based ranking approaches consider the intrinsic structure of the texts instead of treating texts as simple aggregations of terms. Thus it is able to capture and express richer information in determining important concepts.

There are dozens of variations of this algorithm in terms of how to define the vertices and how to measure the edge weights, but most of them follow the same structure. Here we propose a new type of derivation called dependency relation based TextRank (dr-TR). The words with the same lemma are modeled as the vertices in the graph. There is an edge between two vertices if they are linked by a couple of dp-relations that are predefined in the candidate selection phrase. In terms of edge weight, the classical TextRank algorithm mainly uses the number of co-occurrences. We add phrase-ness and semantic-ness into consideration since when a word “votes” to its neighbour, it should vote more to the neighbours that are more similar to itself or form a phrase together. The formula is defined as the following:

The formula of dp-TR

freq(.) means how often it occurs in all reviews of a product and PPMI is calculated based on an external corpus. Attraction score defines how the words attract each other by simulating the formula of gravity. We iterate through all reviews and set up those vertices, edges, and weights. An illustrative example is shown below. Once the graph is constructed, a classical PageRank algorithm is run for 1000 iterations, the final score of each vertex is called Text Rank score.

More formally, the iteration process is defined as the following. where d is the damping factor(usually 0.85).

The formula of WordRank iteration

Section 4.4 Domain Relevance Score (DR)

While dr-TR models the tag candidates popularity locally, the domain relevance score models tag candidates in global domain relevance [11]. We get a corpus of customer reviews from different domains including Sports, Beauty, and Nutrition, and calculate a domain relevance score for each domain. The detailed formula is listed below. For each product, we first get its domain category, then we find the associated score for every tag candidate. It is a more flexible measurement compared to IDF since it allows words to have a different score in different domains. A word like “tablet” clearly has different importance in “medicine” domain and “movie” domain.

The formula of Relative domain relevance

w_tj is a TF-IDF-like weight of candidate t in document j. N means the document number in domain D. R(t, D) includes two measures to reflect the salient of a candidate in D. The first part reflects how frequently a term is mentioned in a particular document. W_j denotes the word number in document j. The second part quantifies how significantly a term is mentioned across all documents in D.

All of the ranking calculation above is based on unigrams. For bigrams, we average the score of the unigrams. This might cause a problem in which a very important word will have too much influence on the final result. For instance, if “vitamin” has a very high score, all tags of vitamins like “good vitamin”, “vitamin supplement”, “vitamin taken” will take all the top positions. In the tag selection set, we try to avoid semantic overlap between selected tags.

Section 4.5 Tag Selection

Finally, we come to the last selection phase. Usually, in literature people select the top n tags in ranking order. This approach is acceptable if the tags extracted do not share too much semantic information, which is usually not true in an industrial setting with a large volume of reviews. We explicitly increase coverage and diversity among the selected keyphrases by introducing an embedding-based maximal marginal relevance for new phrases. This combines in a controllable way the concepts of relevance and diversity. We show how to adapt MMR to keyphrase extraction, in order to combine keyphrase informativeness with dissimilarity among selected keyphrases.

The original MMR from information retrieval and text summarisation is based on the set of all initially retrieved documents, R, for a given input query Q, and on an initially empty set S representing documents that are selected as good answers for Q. S is iteratively populated by computing MMR as described in the formula, where Di and Dj are retrieved documents, and Sim_1 and Sim_2 are similarity functions.

MMR for selecting tags dynamically

When λ = 1, MMR computes a standard, relevance-ranked list, whilst when λ = 0 it computes a maximal diversity ranking of the documents in R. In our setting Sim_1 is our ranking score while Sim_2 is the cosine similarity between word embedding vectors, and λ is set to be 0.5 for equal importance.

Section 4.6 Additional Features

We provide aspect level sentiment analysis together with tag hierarchy for internal business analysts. Due to space limitation, we only give a brief introduction of those two features.

Aspect-level sentiment analysis tries to predict the polarity of all aspects within one review while classical sentiment analysis only predicts one polarity for the whole review. For example, “I like the food here but the environment is terrible”, the aspect level sentiment analysis will give sentiment analysis for environment and food separately, making more concise sentiment analysis.

The tag hierarchy is established by combining neural topic modeling and graph-based keyword extraction together. For example, “good quality” and “bad quality” are talking about the same property of products, which are grouped together as “quality related”. “Quick delivery” and “terrible packaging” are talking about the same topic of “customer service”, and so they are grouped under this topic. By combining different extraction methods in multiple granularities, three layers of tag tier are established.

Section 5 Conclusion

In this blog, we demonstrate how to build an industrial scalable product review extraction model in a step by step approach. We state the motivation of this project and give a quick literature review about the research domain. Then we illustrate the whole framework of the algorithm and explain the major steps in details. The results seem very promising and we plan to make these features available online soon.

We’re recruiting

Find out about the exciting opportunities at THG here:


  1. Mining and Summarizing Customer Reviews
  2. Extracting Product Features and Opinions from Reviews
  3. Movie Review Mining and Summarization
  4. TextRank: Bringing Order into Texts
  5. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction
  6. Simple Unsupervised Keyphrase Extraction using Sentence Embeddings
  7. Corpusindependent Generic Keyphrase Extraction Using Word Embedding Vectors
  8. An unsupervised aspect-sentiment model for online reviews
  9. Aspect extraction through semi-supervised modeling
  10. An Unsupervised Neural Attention Model for Aspect Extraction
  11. Extracting Opinion Targets and Opinion Words from Online Reviews with Graph Co-ranking

THG Tech Blog

THG is one of the world’s fastest growing and largest online retailers. With a world-class business, a proprietary technology platform, and disruptive business model, our ambition is to be the global digital leader.


    Written by

    Data Scientist @ THG, We’re recruiting — thg.com/careers

    THG Tech Blog

    THG is one of the world’s fastest growing and largest online retailers. With a world-class business, a proprietary technology platform, and disruptive business model, our ambition is to be the global digital leader.

    Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
    Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
    Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade