Machine Learning, NLP: Text-Classification with Amazon review data using Python3, step-by-step tutorial.

Text-Classification is one of the active topics of research called Natural Language Processing (NLP). This article provides a supervised way to solve this problem, that is, the model learns from labeled data.

The complete source code used in this article is available here.

First Step: Gathering Dataset

Amazon Review DataSet is a useful resource for you to practice. Here, we choose a smaller dataset — Clothing, Shoes and Jewelry for demonstration.

Format is one-review-per-line in json.


  • reviewText — text of the review
  • overall — rating of the product

If you are not familiar with json file, you can convert it to csv, just following the code below.

Since the value of overall is from 1 to 5, for convenience’ sake, we treat reviewText as two classification — negative and positive, labeled -1 and +1 respectively.

Here are the rules:
1. If the rating is 1 or 2, then the reviewText is treated as the negative, labeled -1.
If the rating is 4 or 5, then the reviewText is treated as the positive, labeled+1.
3. Ignore
all reviewsText with the rating 3. Since they belong to a neutral sentiment.

Splitting Dataset into Training and Testing sets

Training set — a subset to train the model.
Testing set — a subset to test the trained model.

Using the package of scikit-learn, we can split data easily.

Second Step: Text Data Processing

The second and the most important step — clean dataset. A good model is not only dependent on the algorithm, but on a clean dataset mostly. There are many tactics in Text Data Processing, such as

  • Remove non-alphanumeric characters, except for white space.
  • Convert all characters to lowercase, in order to treat the words such as “Hello”, “hello”, “HELLO”, “HEllO” all the same.
  • Consider Tokenization, Stemming, and lemmatization.

and so on.

In addition to clean dataset, we must represent text as a vector of “number” in order to satisfy the requirement since machine learning models always take numeric values as input.

Bag-of-words model

Here, we use the most common encoding method, Bag-of-words model, to encode the text as the bag of its words.

There are two stages — Tokenizing and Vectorizing.

Tokenization is the task of chopping a sentence up into words, called tokens.

For example, if we tokenize three sentence S1, S2 and S3, what we will get is a token_set, each element in it is called a token.

Next stage is Vectorization, which creates a vector to represent a text.
For example, take the S3: “Joe likes horror movies and action movies” and check the frequency of words from the tokens_set.

"John": 0
"Mary”: 0
"Joe": 1
"likes": 1
"to": 0
"watch": 0
"movies": 2
"too": 0
"horror": 1
"and": 1
"action": 1

Thus, the vector of S3 can be represented as:


In scikit-learn, CountVectorizer is a good tool to help us construct the Bag-of-words model that encoding data into the vector form.

Note that the parameter token_pattern is a regular expression denoting what constitutes a “token”, setting token_pattern=r’\b\w+\b’ means we only take a word as a token.

Applying fit_transform method to training set (i.e. train_X), we will get the vocabulary dictionary (i.e. token_set) and the vector of all reviewText (i.e. train_vector)

To help understand the result easily, we visualize the form of train_vector and mark the index, as shown in the following table.

Applying transform method to test set (i.e. test_X), we will get the vector of all reviewText with the same token_set (i.e. test_vector)

Back to amazon dataset, ConuntVectorizer will do the pre-processing on text data before creating the vector, which we mentioned at the beginning of this section. Thus, we don’t need to clean data by ourselves.

The code is:

Final Step — Model Constructing

For the classification problem, we use the popular model — Logistic Regression for demonstration. Below shows how we utilize the model.

If you want to know more about logistic regression, please check here.

I hope this article is helpful/useful to you, and if you like it, please give me a 👏. Any feedbacks, thoughts, comments, suggestions, or questions are welcomed!

Graduate student in Computer Science at NCKU