Topic Modeling: Latent Dirichlet Allocation on Review Indonesia E-commerce Dataset

Why do we need to do topic modeling?

Katarina Nimas Kusumawati
9 min readFeb 17, 2022

Natural language processing of the reviews that are most commonly done is sentiment analysis where we can see the “emotions” of the author of the writing. But finding sentiment alone is not enough. Need to be investigated further, what aspects make people satisfied using the application? What aspects make people dislike using the app? Because of the application? Expensive stuff or something? Due to a large amount of data, it is difficult to find hidden patterns in each document, so a form of unsupervised learning is needed to find these groups.

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is included as unsupervised learning. LDA looks for hidden groupings in the data. Groups in LDA can be referred to as topics. Topics calculate the probability distribution of groupings in each document. Word distribution was calculated based on the frequency of topics and words present in the document to determine similarity.

Data Source

For data sources, we use Google Play Store and App Store API then the review data collected in Google BigQuery. We have a review from November 2020 — to November 2021 with a total of 1708492 data.

If you want to know how we get the data source, kindly check this link below:

Use this code to get the data from Google BigQuery:

The result:

Data from Google BigQuery

Preprocessing

  • Remove duplicate data

Before we go through the modeling, we will remove duplicate data. We kept the recent reviews in created_date, based on review_id (review_id is unique).

  • Case folding

The lower casing is used to avoid misunderstanding by the machine to words that are the same but are considered different. For example, the word “shop”. “Shop” and “shop” are two similar words, but the engine may perceive them differently because one is capitalized, the other is not.

  • Remove punctuations

The punctuation here has no significant meaning, so it needs to be removed.

  • Stemming

Stemming is removing the suffix that is in the word so that it becomes the root word. Stemming is sometimes not a perfect way to convert a word to a root word but it is quite efficient to do. There are still very few libraries that can do stemming in Indonesian. One of the most famous is Sastrawi.

  • Remove stopwords

These stopwords are words that are repeated and have no special meaning, such as conjunctions. The presence of stopwords will bring up meaningless topics.

  • Tokenization

The words in the document are then tokenized per word.

  • Formalization

Formalization is one step to change the word into a formal form / easy to understand. Here I use formalization to change the brand name into a word form that can represent the brand.

  • Delete documents that only consist of 1 word, because they do not contain meaningful topics in the document

Modeling

Before doing the modeling, first, create a dictionary from the words in the document. The words in the document will be indexed.

# turn our tokenized documents into a id <-> term dictionarydictionary = corpora.Dictionary(texts)

Remove the term

  1. Appears in less than 15 documents
  2. Appears in more than 0.5*total documents
  3. After (1) and (2), save only the most frequent first 100000 terms
dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

Turn the dictionary into a bag-of-words reference object. In LDA, documents are considered a bag-of-words model. The resulting corpus mapping is in the form (word_id, word_frequency). For example (0,1), means that the word id 0 appears 1 time in a document.

# convert tokenized documents into a document-term matrixcorpus = [dictionary.doc2bow(text) for text in texts]

To perform baseline code, the following is a sample code and explanation. It will also be displayed at the end of the results from the baseline and experiments using hyperparameter tuning.

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=4)

Before tuning the model, check the number of topics that have the best word distribution. How can we know? Look at the top 3 keywords, see if there are some words that not overlap with another topic. Make sure that each topic contains a unique keyword.

Check model

To display topics from LDA you can use the following code. The code returns a list of topics, each represented either as a string (when formatted == True) or word-probability pairs.

lda_model.show_topics()
Return from show_topics ()

To see the weight of a word, first, you need to know about the order of the terms in the dictionary and then look at the weight, for example as follows.

It will shows like this

Coherence

A statement is said to be coherent if they support each other. A coherent statement can be interpreted in a context that includes all or most of the statements. Coherence calculates the score of a topic by looking at the semantic similarity between words with high scores in the topic.

#Compute Coherence Scorecoherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')coherence_lda = coherence_model_lda.get_coherence()print('\nCoherence Score:', coherence_lda)

The result

Coherence Score

Perplexity

Perplexity is the inverse probability of the test set, normalized by the number of words. Minimizing perplexity is the same as maximizing probability. Perplexity captures how a model relates to new data that has not been seen before. If a model has a low perplexity, where the model is not surprised by the incoming data, then the model has a good understanding of how the language works.

print('\nPerplexity Score:', lda_model.log_perplexity(corpus))
Perplexity Score

Don’t forget to save the model.

Save The Model

lda_model.save('lda_model')

Labeling Test Data

The test data will then be checked for score. First, the test data is loaded and then labeled with the LDA model.

Preprocess

Test data is preprocessed using the same steps as when preprocessing train data. We took 1000 random sampled data.

Load the Saved Model

The model is loaded and stored in a variable in this case is ldamodel. For the dictionary use id2word that has been obtained from the model that has been loaded. After that, labeling is done on the data and seen which document is dominant on what topic, how big is the percentage of its contribution, and what keywords are in that topic.

Labeled Data Test

Check the total number of documents for each topic

It will shows the visualization and the total document on each topic.

Word distribution.

As you can see, 6 and 7 topics have overlap keywords. So we choose 4 and 5 topics to check the score.

Check score

How can we know that we have selected the correct number of topics? In LDA which is unsupervised learning, it is quite tricky to know whether what we are doing is right or not. One thing that can be done is to manually label the test data that has been labeled with LDA. The output is made into CSV then manually label in the Label column. See if the document is relevant to the existing keywords. For example, there is topic 1 with the keywords “shop”, “send”, “shipment”. Topic 2 with the keywords “application”, “bug”, “slow”. Then there are the following documents:

“The shipment is worse”

LDA caught that as topic 2. Hm, weird right? With Manual labels, you give the document topic 1.

Example of Manual Labelling

Use F1-Score because that’s a good metric for data imbalance.

Why is it necessary to also look at accuracy for documents that have a topic contribution percentage that is >= 0.9? Because of the engine has given a very high value to the document, then the document has keywords that are on the topic. Is it possible if the topic contribution is very large, but the document does not correlate at all with the keywords on the topic? It’s possible, we can see that as an error in the model.

After you can get 4 and 5 topics, get 1000 random sample data, then labeled using LDA. Sort them by topic percentage contribution (largest — smallest), and manually labeled 200 data with high topic percentage contribution. If we use 4 topics, we labeled 50 data for each topic.

The best number of topics is 4 topics. Then we try to tune the model.

Tune the Model

Run the LDA model using LDA multicore which uses all CPU cores to work in parallel and increases training speed. For the baseline model, here is the sample code.

Passes and iterations

Passes 20. With 100 iterations. Passes is the number of training passes through the corpus. For example, if the training corpus has 500 documents, passes is 5:

#1 documents 0–500

#2 documents 0–500

#3 documents 0–500

#4 documents 0–500

#5 documents 0–500

Meanwhile, iterations are set loop on each document.

Alpha eta

Alpha: document topic-density

Eta: topic word density

The smaller the alpha, the less likely there will be mixed topics. The greater the alpha, the more likely the topics to be mixed up.

Eta controls the word-by-topic distribution. The smaller the eta, a topic will have fewer words, the larger it is, a topic will likely have more words.

lda_model = gensim.models.ldamulticore.LdaModel(corpus=corpus, num_topics=4, id2word=dictionary, passes=20, iterations=100, alpha=[0.01]*4, eta=[0.01]*len(dictionary.keys()))

This is the difference before and after tuning.

Finally, we used model with hyperparameter tuning. We labeled data for data training topic classification (check my post below).

This is the keyword that we got:

  • Topic 0: barang, kirim, cepat, murah, belanja, sesuai, harga, pesan, layan, beli
  • Topic 1: ongkir, gratis, belanja, pakai, bintang, promo, voucher, cod, mahal, diskon
  • Topic 2: aplikasi, bayar, tolong, akun, iklan, pakai, tipu, beli, update, unduh
  • Topic 3: belanja, aplikasi, mudah, bantu, online, sukses, rumah, butuh, beli, cari

Based on the paper written by Tavakoli et al., 2018, the following topics can be categorized as follows

Topic 0: General comments

  • Helpfulness, rating, conformity with buyer’s expectations

Topic 1: Price

  • Money (worth the money), price-related, additional Cost

Topic 2: Application

  • Updates (comparing to the previous version), update issues, versioning

Topic 3: Other

  • Additional programs needed, praise, company

Recap Score:

Here are the results of the baseline and hyperparameter tuning.

Conclusion:

  • The best model is the model with 4 topics, 100 iterations, passing 20, with alpha=[0.01]*k eta=[0.01]*len(dictionary.keys())

The next step

Do topic classification

References:

  1. https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
  2. https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d
  3. https://radimrehurek.com/gensim/models/ldamulticore.html
  4. Tavakoli, M. et al. (2018) ‘Extracting useful software development information from mobile application reviews: A survey of intelligent mining techniques and tools’, Expert Systems with Applications, 113, pp. 186–199. doi: 10.1016/j.eswa.2018.05.037.

Here is my medium post which is similar to this post:

Thank you for reading!

--

--

Katarina Nimas Kusumawati

Sometimes I struggle with data, sometimes I just wanna be a Pikachu