Analyzing Customer reviews using text mining to predict their behaviour

Sowmya Vivek
Aug 22, 2018 · 18 min read

Analyzing customer reviews to predict if a customer will recommend the product


Text mining is the process of examining large collections of text and converting the unstructured text data into structured data for further analysis like visualization and model building. In this article, We will utilize the power of text mining to do an in-depth analysis of customer reviews on an e-commerce clothing site.
Customer reviews are a great source of “Voice of customer” and could offer tremendous insights into what customers like and dislike about a product or service. For the e-commerce business, customer reviews are very critical, since existing reviews heavily influence buying decision of new customers in the absence of the actual look and feel of the product to be purchased.

About the data set

The dataset that we will be using for this article is from Kaggle ( and is from a Women’s Clothing E-Commerce site revolving around the reviews written by customers.
This dataset includes 23486 rows and 10 feature variables. Each row corresponds to a customer review, and includes the variables:

  • Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
  • Age: Positive Integer variable of the reviewers age.
  • Title: String variable for the title of the review.
  • Review Text: String variable for the review body.
  • Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
  • Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.
  • Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
  • Division Name: Categorical name of the product high level division.
  • Department Name: Categorical name of the product department name.
  • Class Name: Categorical name of the product class name.

Based on the variables, there are several supervised and unsupervised techniques that can could be performed on the above dataset to throw several insights on customer preferences. However, we would limit the scope of this blog only to using text mining extensively to analyze the customer reviews.

We will be using the following techniques to understand various aspects of text mining:

  • Exploratory analysis of text data (Review Text) individually and based on how it impacts the customer decision to recommend the product (Recommended IND)
  • Classification models that are built based on the review text as the independent variable to predict whether a customer recommends a product

Since the objective of the blog is more to understand text mining, the focus will be to understand differences between customers who recommend a product and those who don’t rather than predicting the customer action based on the review. In other words, we will be focusing more on variable importance and coefficient scores of the models than model performance measures.

In terms of text mining approaches, there are 2 broad categories

  • semantic parsing where the word sequence, word usage as noun or verb, hierarchial word structure etc matters
  • Bag of words where all the words are analysed as a single token and order does not matter.

Our exercise will only be limited to the “bag of words” approach and will not look into semantic parsing.

High-level approach of the text mining process

STEP1 — Text extraction & creating a corpus

The packages required for text mining are loaded in the R environment:

Once the required packages are installed, the working directory is set and the csv files are read into R:

The argument ‘stringsAsFactors’ is an argument to the ‘data.frame()’ function in R. It is a logical argument that indicates whether strings in a data frame should be treated as factor variables or as just plain strings. For text mining, we typically set it to FALSE so that the characters are treated as strings enabling us to use all the text mining techniques appropriately. It is set to TRUE if we plan to use the variable as a categorical variable

The column Review.Text contains the customer reviews received for various products. This is the focus for our analysis. We will now try to understand how to represent text as a data frame.

  1. First, the review.text is converted into a collection of text documents or “Corpus”.
  2. To convert the text into a corpus, we use the “tm” package in R.
  3. In order to create a corpus using tm, we need to pass a “Source” object as a parameter to the VCorpus method.
  4. The source object is similar to an abstract input location. The source we use here is a “Vectorsource” which inputs only character vectors.
  5. The Review.text column is now converted to a corpus that we call “corpus_review”

STEP2 — Text Pre-processing

The ultimate objective of any text mining process using the “bag-of-words” approach is to convert the text to be analysed to a data frame which consists of the words used in the text and their frequencies. These are defined by the document term matrix (DTM) and the term document matrix (TDM) which we will look into, in the subsequent sections.
To ensure that the DTM and TDM are cleaned up and represent the core set of relevant words, a set of pre-processing activities need to be performed on the corpus. This is similar to the data clean-up done for structured data before data mining. The following are some of the common pre-processing steps:

  1. Convert to lower case — this way, if there are 2 words “Dress” and “dress”, it will be converted to a single entry “dress”

2. Remove punctuation:

3. Remove stopwords: “stopwords” is a very important concept to understand while doing text mining. When we write, the text generally consists of a large number of prepositions, pronouns, conjunctions etc. These words need to be removed before we analyse the text. Otherwise, stopwords will appear in all the frequently used words list and will not give the correct picture of the core words used in the text.There is a list of common stopwords used in English which we can view with this command: stopwords(“en”)

We might also want to remove custom stopwords based on the context of the text mining. These are words specific to the dataset that may not add value to the text.

In linguistics, stemming is the process of reducing inflected (or derived) words to their word stem, base or root form-generally a written word form.

The SnowballC package is used for document stemming. For example “complicated”, “complication” and “complicate” will be reduced to “complicat” after stemming. This is again to ensure that the same word is not repeated as multiple versions in the DTM and TDM and we only have the root of the word represented in the DTM and TDM.

Corpus content

The corpus object in R is a nested list. We can use the r syntax for lists to view contents of the corpus.

We now have a text corpus which is cleaned and only contains the core words required for text mining. The next step is exploratory analysis. The first step in exploratory data analysis is to identify the most frequently used words in the overall review text.

Frequently used words in the corpus

The words “Love”, “fit”, “size”, etc are the most frequently used words.

STEP3 — Create the DTM & TDM from the corpus

The pre-processed and cleaned up corpus is converted into a matrix called the document term matrix.

A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

The term-document matrix is a transpose of the document-term matrix. It is generally used for language analysis. An easy way to start analyzing the information is to change the DTM/TDM into a simple matrix using as.matrix().

The TDM can also used to identify frequent terms and in subsequent visualization related to the review text.

Top 10 words from TDM

STEP4 — Exploratory text analysis

Word cloud is a common way of visualizing a text corpus to understand the frequently used words. The word cloud varies the size of the words based on the frequency.

Word cloud based on word frequencies

The word cloud can also receive a set of colors or a color palette as input to distinguish between the more and the lesser frequent words in the cloud.

One of the main objectives of this study is to analyse the difference in keywords between those who recommend and those who don’t recommend the product. For this purpose, we will create 2 corpora — one for Recommend-yes and another for Recommend-no. All the pre-processing steps done previously are repeated for both the corpora. Then the frequently used words are plotted as separate bar plots and word clouds for each of the corpora to understand the difference in the words used by customers who recommend a product vs those who don’t.

Frequently used words by customers who recommend (Green) vs those who don’t recommend (Red) the product

Another way of comparing the word sets is to combine the corpora for yes and no and create comparison clouds which display both the sets of words in the same cloud. For this, we will use 2 more versions of the word cloud — the commonality cloud and the comparison cloud. The commonality cloud combines words from both the clouds in a single corpus. It finds words shared amongst both the corpora and plots a word cloud of the shared words.

Commonality cloud

The comparison cloud on the other hand will identify dissimilar words used between the 2 corpora.

Comparison cloud

The comparison cloud gives a clear contrast of words used by people who are happy with the product compared to those who are not. The people who have not recommended the product have used negative words like disappoint, return, cheap, look etc.

Another interesting aspect to notice is the “No” list of words having “fabric” and the Yes list of words having “fit”. This could imply that most people who are unhappy with the product are unhappy with the fabric and customers who recommend a product are happy with the fit. So overall fabric quality could be a main problem compared to fit and the business can analyse in these lines.

A polarized tag plot is an improved version of the commonality cloud. It determines the frequency of a term used in both the corpora under comparison. The difference in frequencies of common words might be insightful in many cases.
For this plot, the plotrix package is loaded. First a matrix is created with all the common words using a subset to ensure that it contains only words occurring in both the classes. The matrix then has another column for the absolute difference between both the corpora for each word and the plot is made.

Polarized tag plot

In our dataset, the proportion of Recommend — Yes to Recommend No is not balanced. About 82% have recommended the product. So the “No” side will have relatively lesser number of words. In such an imbalanced dataset, it could also be useful to use the absolute % difference between the 2 groups rather than just the absolute difference.

Word clustering is used to identify word groups used together, based on frequency distance. This is a dimension reduction technique. It helps in grouping words into related clusters. Word clusters are visualized with dendrograms.

Dendrogram of word clusters

The cluster dendogram shows how certain words are grouped together. For example, “soft”, “material” & “comfort” have been used together. Since the clustering is based on the frequency distances, the cluster indicates which set of words are used together most frequently.

Word association is a way of calculating the correlation between 2 words in a DTM or TDM. It is yet another way of identifying words used together frequently. For our corpus, the word association plot indicates correlation between various words and the word “fit”.

Word associations with the word “fit”

The word “fit” has the greatest association with “perfect” and “size”, which is the positive aspect of the product. The third highest word associated with “fit” is “loos” which indicates the negative aspect of the product.

All the analysis that we have done so far have been based on single words that are called as Unigrams. However, it can be very insightful to look at multiple words. This is called as N-grams in text mining, where N stands for the number of words. For example, bi-gram contains 2 words.
We will now look at how to create bi-grams and tri-grams and perform some exploratory analysis on the same.

Top 10 bi-grams
Top 10 tri-grams

All the analysis done on the unigrams in the previous sections can also done on the bi & tri-grams, to get more insights on the text. Let us now explore how the bi-grams vary for products recommended and not recommended.

Barplot of frequently used tri-grams: Yes vs No

STEP5 — Feature extraction by removing sparsity

Sparsity is related to the document frequency of a term. In DTM, since the terms form the columns, every document will have several columns each representing one term — a unigram, bi-gram, tri-gram, etc. The number of columns in the matrix will be equal to the count of unique words in the corpus, which will be relatively high.
For non-frequent terms, the column might have zero in several documents. This is called sparsity. Before any classification exercise involving the DTM, it is recommended to treat sparsity.

The exploratory text analysis has given several insights based on the customer reviews. We will now use the same review text as predictor variable to predict whether the product will be recommended by the customer. In terms of classification algorithms used, there is not much of a difference between data and text input. We will try 3 of the most popular classification algorithms — CART, Random forest and Lasso logistic regression.
In terms of converting the text suitable for modeling, we can use the same packages mentioned above. However, to get the readers familiar to more text mining packages, we will be using a slightly different approach.

Tokenisation is the process of decomposing text into distinct pieces or tokens.

This is what we called as bag-of-words in our previous section. Once toeknisation is done, after all the pre-processing, it is possible to construct a dataframe where each row represents a document and each column represents a distinct token and each cell gives the count of the token for a document. This is the DTM that we learnt in the previous section.

The pre-processing steps are carried out on the tokens, similar to what was done on the text corpus.

The tokens are now converted to a document frequency matrix and treated for sparsity.

STEP6 — Building the Classification Models

We now have the dfm which is pre-processed, treated and ready to be used for classification. To use this in a classification model, the following steps are carried out:

We will first use the CART algorithm for classification.First the complete tree is built and the optimum cp value is identified for which the error is minimum. Then this cp value is used to obtain the pruned tree. The pruned tree is plotted to understand the classification.

CART of the review text

As can be seen from the tree plot, words like “return”, “disappoint”, “back”, “huge”, etc are used by unhappy customers — i.e., customers who do not recommend the product. The tree can be interpreted further to understand the word patterns used by customers who recommend the product vs those who don’t.

The next classification algorithm we will use is the Random forest. We will examine the varimp plot of the randomforest model to understand which words affect the classification the most.

VarImp plot of Random Forest Model

In sync with the CART model, the varimp plot of the Random forest model also indicates that “return”, “disappoint” are the most important variables. Most other words like “unfortunate”, “sad”, etc also indicate that these negative words are typically used by unhappy customers.

The next classification algorithm we will be using is logistic regression. As we discussed in sparsity, the main challenge associated with text mining data frames is the very high number of columns or features. This will adversely impact models like logistic regression. Hence we will use Regularisation using lasso for feature reduction and subsequently logistic regression for model building and classification. The odds ratio of the logistic regression model will throw several useful insights on classification.

Based on the lasso regression we arrive at the lambda_min that we need to use for the logistic regression. An examination of the coefficient matrix will throw light on the feature reduction performed by the lasso. The features which need not be included in the model have a coef of zero.

For ease of explanation, the coef have been written to a csv file. The csv file will be saved to the working directory that was set in the beginning. Let us examine the csv file for more clarity.

If the csv file is filtered on the coef columns for zero (column B), we find that 84 of the 285 features (x variables) have a coef of zero — i.e, these have been removed by the lasso model.

The reduced set of features is used to build the logistic regression model. From the coefficients of the logistic regression model, we can calculate the odds ratio using the formula:
Odds ratio = exp(coef(LR model))

The odds ratio is a very unique advantage of probability based models like logistic regression. We will use the odds ratio to understand the influence of various x variables in classification.

As with the lasso coef, the odds ratio is also written to a csv file for better understanding. Let us examine the csv file for further insights. If the csv file for odds ratio is sorted based on the odds ratio (column B) in descending order, we can examine the variables with the highest odds ratio:

The odds ratio is interpreted as follows: A product which has “compliment” in its review has 5.61 times more odds to be recommended compared to a product review which does not have the word. The odds ratio of the other words can also be similarly interpreted.

Let us now examine the logreg_coef.csv file to understand some variables that have a negative coefficient:

The term “disappoint” has a negative coefficient indicating that if this term is present in the review, the probability that the product will be recommended is very low. The same interpretation can be made for the other terms with a negative coefficient.

All the 3 classification models throw almost similar insights in terms of what terms are used by happy customers who will recommend a product vs those used by unhappy customers. These could provide vital leads to improving product/service quality, thereby improving the promoters of the product.

Inspite of being a relatively long article, this article covers only the tip of the iceberg of the concepts of text mining. The main objective of the article is to get readers introduced to the concepts of text mining. The article also makes an attempt to consolidate the various libraries and code blocks required for fundamental text mining and text classification, so that readers can follow through the steps to perform similar analyses on text content that is of interest to them.

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…