Improving Customer Buying Decision Using Extractive Text Summarization(TextRank Algorithm)

Anuj Shah
Analytics Vidhya
Published in
8 min readNov 15, 2019

With the growing need of the internet for fetching information and availability of huge amounts of data, it becomes important to filter the content and convert the abundant information into a summary capturing the key components which can be helpful to analyze and make decisions. In the age of technology, people are informed about any piece of news by reading a short limited version of the news.

Manually analyzing each segment of a text and generating summaries can become both time consuming as well as backbreaking. Therein, comes the need of an automatic text summarization technique representing the most relevant and voted responses can help the audience absorb the information and saves time from going through redundant data.

Online Buying has become an everyday part of human life. Summarization of reviews of a particular product by different consumers can help a potential buyer identify the product usefulness and its ability to satisfy consumer needs. Summarization, using intelligent computer based algorithms can help capture the essential syntactical and semantic representations of the text. The use of such a system can help consumers make effective decisions about buying a product.

RELATED WORK

Hyper Induced Topic Search(HITS)

Hyper Induced Topic Search algorithm was developed by Jon Kleinberg used to rank web pages for a search query. The same algorithm can be applied to text summarization where, instead of the web pages, we rank the sentences in the text. The words are represented Term Frequency-Inverse document frequency values and the sentence similarity is calculated using cosine similarity. A sentence can be considered as pointing to other sentence if the similarity measure between the given pair exceeds a threshold value. The two scores used in the algorithm are Authority Value and Hub Value.

Authority Value: For a given sentence, it’s authority score is the sum of the hub value of the sentences that points to it. A sentence is said to possess authority if it is pointed by other sentences[3].

Hub Value: The hub score is the sum of all the authority value of the sentences it points to[3].

The two values imply that a sentence has a higher score if it’s been pointed by higher hub valued sentences. Initially, the hub values for sentences are 1 and authority update rule and hub update rule is applied until convergence.

PageRank Algorithm

PageRank algorithm is used by Google to determine a webpage’s relevance for a query. Webpages are said to be connected when one contains the link to others. PageRank deals with the probability of jumping to a particular webpage[4]. A similar system can be imitated for using PageRank as TextRank. In textrank, the tokenized sentences will be considered as the nodes for our graph and weights represent the sentence similarity score. Use of an iterative formula until score converge below a threshold value.

PR(A) = (1-d) + d (PR(T1)/C(T1) + … + PR(Tn)/C(Tn))

Shortest Path Algorithm

In case of ranking, the selected sentences belong to different parts of the text and there is a possibility of sudden topic shifts while generating the summary. In this approach, the sentences form the nodes and weights are assigned based on number of common words. Further the sentences in the original text, the higher the cost.

Sim(Si , Sj ) = |{wk|wk ∈ Si&wk ∈ Sj}| log(|Si |) + log(|Sj|)

where, wk represent the words in the sentence.

The goal is to start from the first sentence of the original text and reach to the last sentence choosing the cheapest path discarding the loop paths.

PROPOSED WORK

focuses on extractive Graph based summarization technique. The objective is to examine the effect of summarization in case of reviews/feedback of products. The procedure and results are discussed below. Extractive graph based Summarization is a 5 step process :

  1. Text Processing.
  2. Sentence Scoring.
  3. Create intermediate representation of sentences in form of a graph.
  4. Finding Sentence Rankings.
  5. Generating Summary[5].

DATASET SELECTION

In our proposed network, we would be working with the Amazon Product reviews dataset which has reviews of a total of 5000 customers for different 23 products. The customer reviews per product for 5 products is shown in Fig. below.

Fig. 1: Reviews per product(Only 5 products)

TEXT PROCESSING AND REPRESENTATION

Grouping Reviews

The first step is to concatenate the reviews of all customers by grouping them according to the product they bought. These would give us a large text representation of reviews which can be used to generate the summary. We can use pandas group by and concatenate function to do so.

Fig. 2 : Snapshot of Concatenating reviews of same product.

Sentence Splitting

For the generated text , we split the text into sentences. Sentence splitting is done to store each of the review and indexing them so that after further processing and textrank, we can extract the original representation which is syntactically meaningful. Store the sentences in a csv format.

Fig. 3: Sentence Splitting

Decontraction

Reviews are usually written using short forms of the words or combining two words. For example, some consumers may write “The product hasn’t been upto my expectation” and others may write “The product has not been upto my expectation”. In order to avoid the same sentences being identified as different sentences, all the contracted words are converted into their expanded forms.

Fig. 4: List of decontracted words

Unwanted Words Removal

The inclusion of punctuation marks, unwanted breaks, brackets and multiple spaces is common among a large set of reviews. These extra special characters can be due to human mistake or serve as a purpose to bolster the idea. They do not contribute anything to summary formation and therefore, are removed.

Spelling Corrector

Humans seem to make spelling mistakes often while writing anything, whether it be reviews of a large piece of text. The paper makes use of the Peter-Norvig’s spelling corrector logic.

The first step is to edit the word by removing a letter, replacing a letter, swapping letters or inserting a letter. We will create a list of all the edits. The probability of each word in the edited list is calculated as relative frequency to frequency of total words in a large document. From the probability values, we produce the candidates that are likely to replace the word in order of priority.

  1. The original word, if is known.
  2. The list of words at distance one away.
  3. The list of words at distance two away.
  4. The original word, even though it is not known[6].

TEXT CLEANING

The list of processes mentioned above are applied to the sentences stores in the csv format as seen in Fig. (3). The output is a cleaned and accurate representation of original sentences as shown below.

Fig. 5: Sentences after processing

SENTENCE SIMILARITY

Synsets

Before calculating similarity of all sentences in the generated text, it becomes necessary to find the synonyms of the word. In different sentences, same description can be represented using different words. For example, “plant” can be named as “flora” or “plant_life”. To gain better results while generating the summary, we also have to consider the synonyms of words while calculating sentence similarity. These synonyms are called as synsets.

Fig. 6: Synset examples

Similarity Calculation

For calculating between two sentences s1 and s2, first generate individual list of all words in both sentences. Find the synsets of all words and add it to the two different lists synset1 and synset2. To calculate the similarity between every 2 sentences, the algorithm is mentioned below

Fig. 7: Calculating Similarity

If we have n sentences, then the similarity matrix will have n x n dimensions. Store the similarity matrix in a .csv file.

PageRank as TextRank

In textrank, the vertices are the tokenized sentences. Weights are assigned to the edges connecting the node using similarity value. With arbitrary initial values, the iterative formula is run for each node until it converges below a specified threshold value[9]. The textrank is calculated considering the edge weight between two vertices and weights of all the edges between sentences and it’s connected nodes.

Fig. 8: PageRank representation.

The textrank method takes three input parameters, the similarity matrix, the value of epsilon, and the damping factor(d). Each sentence is considered to be a dangling web page. Therefore, the probability of going from one sentence to any other sentence is equal.

Fig. 9: Applying PageRank as TextRank

Damping factor — When visiting a web page, it contains some reference links. The probability that the person surfing the page will link on the reference links is equal to d. The probability that the user will stop at a particular webpage is (1-d). The same logic has been applied to sentences.

Summary Generation

The output of textrank algorithm is the ranking of the sentences. Depending upon our requirements we will select the first n sentences. We will get the original indices of the sentences and use the original sentences in our summary. Concatenating these sentences will generate the summary.

Fig. 10: Summary generated for reviews as shown in Fig. 2

The GitHub implementation for the project is available at https://github.com/anuj200199/Autotext-Summarization-for-improving-customer-buying-decision

REFERENCES

  1. Changjian Fang, Dejun Mu, Zhenghong Deng, Zhiang Wu “Word-Sentence co ranking for automatic extractive text summarization”.
  2. Khusboo S Thakkar, Dr. R. V. Dharaskar, M.B. Chandak , “Graph Based Algorithm For Text Summarization”, page 1–3.
  3. Fadhlil Hadyan, Shaufiah and Moch. Arif Bijaksana “Comparison of Document Index Graph Using TextRank and HITS Weighting Method in Automatic Text Summarization.”
  4. Albi Dode, Silvester Hasani “PageRank Algorithm”.
  5. Jai Prakash Verma and Atul patel “Evaluation of Unsupervised Learning Based Extractive Text Summarization Technique For Large Scale Review and Feedback Data.”
  6. Harshit Pande “Effective search space reduction for spell correction using character neural embeddings “
  7. Rafael Ferreira, Gabriel Silva, Fred Freitas, Rinaldo Lima, Steven J. Simske, “Assessing sentence scoring techniques for extractive text summarization” from ELSEVIER.
  8. Jai Prakash Verma “Spark Frameworks and Architecture” .
  9. Federico Barrios, Luis Agrerich, Rosita Waxhenchauzer, “Variations of the Similarity Function of TextRank for Automated Summarization”.

--

--

Anuj Shah
Analytics Vidhya

Aspiring Data Scientist, Machine learning expert, Data Visualization through tableau and matplotlib.