Topic Classification: Review on Indonesia E-commerce Dataset (TF-IDF and Logistic Regression vs FastText Skip-gram and LSTM)

Katarina Nimas Kusumawati
8 min readFeb 19, 2022

--

In the previous story, I have shared how to make a topic from a review dataset using Latent Dirichlet Allocation, but now, we will predict the labeled review by training them into a deep learning model.

Before we start I will explain to you why I choose 2 methods. So first as a baseline model, I will use TF-IDF and Logistic Regression and for improvement, I used FastText Skip-gram and LSTM.

Data Source

We use Google Play Store and App Store API for data sources, then the review data collected in Google BigQuery. We have a review from November 2020 — to November 2021 with a total of 1708492 data.

If you want to know how we get the data source, kindly check this link below:

Use this code to get the data from Google BigQuery:

Overview of the Data:

Distribution of the Data:

Preprocessing

  • Remove duplicate data

Before we go through the modeling, we will remove duplicate data. We kept the recent reviews in created_date, based on review_id (review_id is unique).

  • Case folding

The lower casing is used to avoid misunderstanding by the machine to words that are the same but are considered different. For example, the word “shop”. “Shop” and “shop” are two similar words, but the engine may perceive them differently because one is capitalized, the other is not.

  • Remove punctuations

The punctuation here has no significant meaning, so it needs to be removed.

  • Stemming

Stemming is removing the suffix that is in the word so that it becomes the root word. Stemming is sometimes not a perfect way to convert a word to a root word but it is quite efficient to do. There are still very few libraries that can do stemming in Indonesian. One of the most famous in literature.

  • Remove stopwords

These stopwords are words that are repeated and have no special meaning, such as conjunctions. The presence of stopwords will bring up meaningless topics.

  • Delete documents that only consist of 1 word, because they do not contain meaningful topics in the document
  • Formalization

Formalization is one step to change the word into a formal form / easy to understand. Here I use formalization to change the brand name into a word form that can represent the brand.

Processed the Dataframe

We need to change the column name so it can be easier to understand.

TF-IDF and Logistic Regression

TF-IDF

Term Frequency — Inverse Document Frequency or TF — IDF is an algorithm method that is useful for calculating the weight of each word that is commonly used. This method will calculate the value of Term Frequency (TF) and Inverse Document Frequency (IDF) for each token (word) in each document in the corpus. In simple terms, the TF-IDF method is used to find out how often a word appears in a document.

Logistic Regression

Logistics Regression is a statistical analysis method to describe the relationship between the response variable (dependent variable) which has two or more categories with one or more explanatory variables (independent variable) on a category or interval scale.

Split Data

Split train data and test data

Create TF-IDF

Convert words to vector using TF-IDF using TFidf vectorizer. The TF-IDF vectorizer considers the overall document weightage of a word. It helps to penalize words that appear too many times. TfidfVectorizer weights the word counts by a measure of how often they appear in the documents.

Modeling with Logistic Regression

Model the data using logistic regression and check the results.

TF-IDF & Logistic Regression result
TF-IDF & Logistic Regression result

FastText and LSTM

FastText

FastText is a form of word embedding which is an improvisation of word2vec. FastText represents each word as n-gram characters. So, for example, take the word, “apple” the FastText representation of this word is <” ap”,” ppl”,”ple”> where square brackets indicate the beginning and end of the word. This method makes FastText more able to catchwords that have suffixes and prefixes. FastText works well with rare words. So even if a word is not visible during training, it can be split into n-grams to get its embeddings.

The skip-gram model learns to predict a target word thanks to a nearby word. On the other hand, the CBOW model predicts the target word according to its context. For example, there is the sentence, A child is surprised to see a big snake. For example, will predict the word “see”. FastText will predict the target using the nearest word like “look”, or “view”. Meanwhile, CBOW takes all the words in the window, namely “surprised”, “to”, “a” “big” and uses the sum of their vectors to predict the target as a bag of the words contained in a fixed size window around the target word.

LSTM

Long Short-Term Memory (LSTM) is a variant model of Recurrent Neural Network (RNN). LSTM arises because it can remember long-term information (long-term dependency). LSTM replaces hidden layer nodes in RNN with LSTM cells which function to store previous information. In the LSTM three gates control the use and update of the previous text information, namely the input gate, forget gate, and output gate. The memory cell and three gates are designed to be able to read, store and update past information.

Word Embedding using FastText SkipGram

Break the document into tokens or words and put them in the “value_tokenize” column.

tokenizer_data = RegexpTokenizer(r'\w+')df_processed['value_tokenize'] = df_processed['review_formal_processed'].map(tokenizer_data.tokenize)
value = df_processed["value_tokenize"]

Insert data in FastText, for

  • min_count is the number of words included in FastText. If there is only one word, it is not included.
  • Size is a vector size, in the latest gensim model, size becomes vector_size.
  • The window is The maximum distance between the current and predicted word within a sentence. Workers associated with training using multicore machines. sg here determines whether we use Skipgram or CBOW. By default, FastText will direct us to the CBOW form. sg=1 is skip-gram.

Check the Word Embedding

Use .most_similar() to check is our words similar with another word.

sg_ft_model.wv.most_similar('kirim')

Note: If you have a pre-trained word embedding, you can use this code.

sg_ft_model = FastText.load(fasttext_sg_100.model')

Get the Size of Word Embedding

Machines cannot process words directly but need to convert words into vector form. Take the weights and vectors using the following code.

Modeling with LSTM

  1. Tokenizer allows vectorizing a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary). Keep only 10000 most frequent words.
  • fit_on_texts will update internal vocabulary based on a list of texts. This method creates the vocabulary index based on word frequency and starts from 1 (0 is reserved for padding).
  • texts_to_sequences will convert each text in the text to a sequence of integers. pad_sequences is used to ensure that all sequences in a list have the same length.
  • pad_sequences transforms a list (of length num_samples) of sequences (lists of integers) into a 2D Numpy array of shape (num_samples, num_timesteps).

2. Change the target to a one-hot encoding form as this case is multiclass labeling.

3. Split the data

4. Build the model. The model used has the following:

  • input_dim: This is the size of the vocabulary in the text data.
  • output_dim: This is the size of the vector space in which words will be embedded.
  • input_length: This is the length of input sequences.
  • weights: list of NumPy arrays to set as initial weights.
  • relu is fast enough on neural network and softmax is used in output layer as multiclass classification.
  • We use dropout to avoid overfitting.

5. Train the model

Try to run with callbacks so when the model has reached the maximum epochs, it will be stopped.

Use the optimizer. Adam is an optimization algorithm that can be used to iteratively update network weights based on training data.

Check the Model

FastText and LSTM result
FastText and LSTM result

Conclusion

You can see that the performance of using FastText and LSTM word embedding has a better F1-score than TF-IDF and Logistic Regression. We use F1-score since is a good metric to use if the data is imbalanced.

What improvements can be made?

If we have plenty of time to do experiments or you want to do experiments you can consider the following:
1. Use a bidirectional LSTM
2. Change the parameter trainable=False, this parameter reduces the possibility of overfitting

Save Model

Save the model and tokenizer, because to be able to load the model it needs the same size.

In the next article, I will explain how to make predictions and then save the results in CSV form.

Reference:

  1. https://dltsierra.medium.com/algoritma-tf-idf-633e17d10a80
  2. Hendayana, R. (2013) ‘Application Method of Logistic Regression Analyze the Agricultural Technology Adoption’, Informatika Pertanian, 22(1), pp. 1–9. Available at: http://ejurnal.litbang.pertanian.go.id/index.php/IP/article/view/2271/1970.
  3. https://blogs.sap.com/2019/07/03/glove-and-fasttext-two-popular-word-vector-models-in-nlp/#:~:text=fastText%20is%20another%20word%20embedding,an%20n%2Dgram%20of%20characters.&text=This%20helps%20capture%20the%20meaning,to%20understand%20suffixes%20and%20prefixes.
  4. https://fasttext.cc/docs/en/unsupervised-tutorial.html
  5. S. Hochreiter dan J. Urgen Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, hal. 1735–1780, 1997.

Here is my medium post which is similar to this post:

Thank you for reading!

--

--

Katarina Nimas Kusumawati

Sometimes I struggle with data, sometimes I just wanna be a Pikachu