Machine Learning, NLP: Text-Classification with Amazon review data using Python3, step-by-step tutorial.

4 min readSep 22, 2018

Text-Classification is one of the active topics of research called Natural Language Processing (NLP). This article provides a supervised way to solve this problem, that is, the model learns from labeled data.

The complete source code used in this article is available here.

First Step: Gathering Dataset

Amazon Review DataSet is a useful resource for you to practice. Here, we choose a smaller dataset — Clothing, Shoes and Jewelry for demonstration.

Format is one-review-per-line in json.

{   
   "reviewerID": "A2SUAM1J3GNN3B",   
   "asin": "0000013714",
   "reviewerName": "J. McDonald",
   "helpful": [2, 3],
   "reviewText": "I bought this for my husband who plays the piano.  He is having a wonderful time playing these old hymns.  The music  is at times hard to read because we think the book was published for singing from more than playing from.  Great purchase though!",
   "overall": 5.0,
   "summary": "Heavenly Highway Hymns",
   "unixReviewTime": 1252800000,
   "reviewTime": "09 13, 2009" 
}

where

reviewText — text of the review
overall — rating of the product

If you are not familiar with json file, you can convert it to csv, just following the code below.

Since the value of overall is from 1 to 5, for convenience’ sake, we treat reviewText as two classification — negative and positive, labeled -1 and +1 respectively.

Here are the rules:
1. If the rating is 1 or 2, then the reviewText is treated as the negative, labeled -1.
2. If the rating is 4 or 5, then the reviewText is treated as the positive, labeled+1.
3. Ignore all reviewsText with the rating 3. Since they belong to a neutral sentiment.

Splitting Dataset into Training and Testing sets

Training set — a subset to train the model.
Testing set — a subset to test the trained model.

Using the package of scikit-learn, we can split data easily.

Second Step: Text Data Processing

The second and the most important step — clean dataset. A good model is not only dependent on the algorithm, but on a clean dataset mostly. There are many tactics in Text Data Processing, such as

Remove non-alphanumeric characters, except for white space.
Convert all characters to lowercase, in order to treat the words such as “Hello”, “hello”, “HELLO”, “HEllO” all the same.
Consider Tokenization, Stemming, and lemmatization.

and so on.

In addition to clean dataset, we must represent text as a vector of “number” in order to satisfy the requirement since machine learning models always take numeric values as input.

Bag-of-words model

Here, we use the most common encoding method, Bag-of-words model, to encode the text as the bag of its words.

There are two stages — Tokenizing and Vectorizing.

Tokenization is the task of chopping a sentence up into words, called tokens.

For example, if we tokenize three sentence S1, S2 and S3, what we will get is a token_set, each element in it is called a token.

S1 = "John likes to watch movies"
S2 = "Mary likes movies too"
S3 = "Joe likes horror movies and action movies"token_set = {"John", "Mary", "Joe", "likes", "to", "watch", "movies", "too", "horror", "and", "action"}

Next stage is Vectorization, which creates a vector to represent a text.
For example, take the S3: “Joe likes horror movies and action movies” and check the frequency of words from the tokens_set.

"John": 0
"Mary”: 0
"Joe": 1
"likes": 1
"to": 0
"watch": 0
"movies": 2
"too": 0
"horror": 1
"and": 1
"action": 1

Thus, the vector of S3 can be represented as:

vector_S3 = [0, 0, 1, 1, 0, 0, 2, 0, 1, 1, 1]

CountVectorizer

In scikit-learn, CountVectorizer is a good tool to help us construct the Bag-of-words model that encoding data into the vector form.

Note that the parameter token_pattern is a regular expression denoting what constitutes a “token”, setting token_pattern=r’\b\w+\b’ means we only take a word as a token.

Applying fit_transform method to training set (i.e. train_X), we will get the vocabulary dictionary (i.e. token_set) and the vector of all reviewText (i.e. train_vector)

To help understand the result easily, we visualize the form of train_vector and mark the index, as shown in the following table.

Applying transform method to test set (i.e. test_X), we will get the vector of all reviewText with the same token_set (i.e. test_vector)

Back to amazon dataset, ConuntVectorizer will do the pre-processing on text data before creating the vector, which we mentioned at the beginning of this section. Thus, we don’t need to clean data by ourselves.

The code is:

Final Step — Model Constructing

For the classification problem, we use the popular model — Logistic Regression for demonstration. Below shows how we utilize the model.

If you want to know more about logistic regression, please check here.

I hope this article is helpful/useful to you, and if you like it, please give me a 👏. Any feedbacks, thoughts, comments, suggestions, or questions are welcomed!