Topic Modeling: Latent Dirichlet Allocation on Review Indonesia E-commerce Dataset

Why do we need to do topic modeling?

9 min readFeb 17, 2022

Natural language processing of the reviews that are most commonly done is sentiment analysis where we can see the “emotions” of the author of the writing. But finding sentiment alone is not enough. Need to be investigated further, what aspects make people satisfied using the application? What aspects make people dislike using the app? Because of the application? Expensive stuff or something? Due to a large amount of data, it is difficult to find hidden patterns in each document, so a form of unsupervised learning is needed to find these groups.

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is included as unsupervised learning. LDA looks for hidden groupings in the data. Groups in LDA can be referred to as topics. Topics calculate the probability distribution of groupings in each document. Word distribution was calculated based on the frequency of topics and words present in the document to determine similarity.

Data Source

For data sources, we use Google Play Store and App Store API then the review data collected in Google BigQuery. We have a review from November 2020 — to November 2021 with a total of 1708492 data.

If you want to know how we get the data source, kindly check this link below:

Running Airflow in Docker

Background Before we begin to configure Airflow using Docker we must first understand why we need to use workflow…

dionricky.com

Scraping App Review Data for Analysis

Background Mobile or portable devices are the most widely used technology of this era. With the increase of internet…

dionricky.com

Data Warehouse Architecture for Mobile App Review Analysis

Background In the early days of business, there is only one kind of database, transactional database. As the business…

dionricky.com

Use this code to get the data from Google BigQuery:

The result:

Preprocessing

Remove duplicate data

Before we go through the modeling, we will remove duplicate data. We kept the recent reviews in created_date, based on review_id (review_id is unique).

Case folding

The lower casing is used to avoid misunderstanding by the machine to words that are the same but are considered different. For example, the word “shop”. “Shop” and “shop” are two similar words, but the engine may perceive them differently because one is capitalized, the other is not.

Remove punctuations

The punctuation here has no significant meaning, so it needs to be removed.

Stemming

Stemming is removing the suffix that is in the word so that it becomes the root word. Stemming is sometimes not a perfect way to convert a word to a root word but it is quite efficient to do. There are still very few libraries that can do stemming in Indonesian. One of the most famous is Sastrawi.

Remove stopwords

These stopwords are words that are repeated and have no special meaning, such as conjunctions. The presence of stopwords will bring up meaningless topics.

Tokenization

The words in the document are then tokenized per word.

Formalization

Formalization is one step to change the word into a formal form / easy to understand. Here I use formalization to change the brand name into a word form that can represent the brand.

Delete documents that only consist of 1 word, because they do not contain meaningful topics in the document

Modeling

Before doing the modeling, first, create a dictionary from the words in the document. The words in the document will be indexed.

# turn our tokenized documents into a id <-> term dictionarydictionary = corpora.Dictionary(texts)

Remove the term

Appears in less than 15 documents
Appears in more than 0.5*total documents
After (1) and (2), save only the most frequent first 100000 terms

dictionary.filter_extremes(no_below=15, no_above=0.5, keep_n=100000)

Turn the dictionary into a bag-of-words reference object. In LDA, documents are considered a bag-of-words model. The resulting corpus mapping is in the form (word_id, word_frequency). For example (0,1), means that the word id 0 appears 1 time in a document.

# convert tokenized documents into a document-term matrixcorpus = [dictionary.doc2bow(text) for text in texts]

To perform baseline code, the following is a sample code and explanation. It will also be displayed at the end of the results from the baseline and experiments using hyperparameter tuning.

lda_model = gensim.models.ldamodel.LdaModel(corpus=corpus, id2word=dictionary, num_topics=4)

Before tuning the model, check the number of topics that have the best word distribution. How can we know? Look at the top 3 keywords, see if there are some words that not overlap with another topic. Make sure that each topic contains a unique keyword.

Check model

To display topics from LDA you can use the following code. The code returns a list of topics, each represented either as a string (when formatted == True) or word-probability pairs.

lda_model.show_topics()

To see the weight of a word, first, you need to know about the order of the terms in the dictionary and then look at the weight, for example as follows.

It will shows like this

Coherence

A statement is said to be coherent if they support each other. A coherent statement can be interpreted in a context that includes all or most of the statements. Coherence calculates the score of a topic by looking at the semantic similarity between words with high scores in the topic.

#Compute Coherence Scorecoherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')coherence_lda = coherence_model_lda.get_coherence()print('\nCoherence Score:', coherence_lda)

The result

Perplexity

Perplexity is the inverse probability of the test set, normalized by the number of words. Minimizing perplexity is the same as maximizing probability. Perplexity captures how a model relates to new data that has not been seen before. If a model has a low perplexity, where the model is not surprised by the incoming data, then the model has a good understanding of how the language works.

print('\nPerplexity Score:', lda_model.log_perplexity(corpus))

Perplexity Score

Don’t forget to save the model.

Save The Model

lda_model.save('lda_model')

Labeling Test Data

The test data will then be checked for score. First, the test data is loaded and then labeled with the LDA model.

Preprocess

Test data is preprocessed using the same steps as when preprocessing train data. We took 1000 random sampled data.

Load the Saved Model

The model is loaded and stored in a variable in this case is ldamodel. For the dictionary use id2word that has been obtained from the model that has been loaded. After that, labeling is done on the data and seen which document is dominant on what topic, how big is the percentage of its contribution, and what keywords are in that topic.

Check the total number of documents for each topic

It will shows the visualization and the total document on each topic.

As you can see, 6 and 7 topics have overlap keywords. So we choose 4 and 5 topics to check the score.

Check score

How can we know that we have selected the correct number of topics? In LDA which is unsupervised learning, it is quite tricky to know whether what we are doing is right or not. One thing that can be done is to manually label the test data that has been labeled with LDA. The output is made into CSV then manually label in the Label column. See if the document is relevant to the existing keywords. For example, there is topic 1 with the keywords “shop”, “send”, “shipment”. Topic 2 with the keywords “application”, “bug”, “slow”. Then there are the following documents:

“The shipment is worse”

LDA caught that as topic 2. Hm, weird right? With Manual labels, you give the document topic 1.

Use F1-Score because that’s a good metric for data imbalance.

Why is it necessary to also look at accuracy for documents that have a topic contribution percentage that is >= 0.9? Because of the engine has given a very high value to the document, then the document has keywords that are on the topic. Is it possible if the topic contribution is very large, but the document does not correlate at all with the keywords on the topic? It’s possible, we can see that as an error in the model.

After you can get 4 and 5 topics, get 1000 random sample data, then labeled using LDA. Sort them by topic percentage contribution (largest — smallest), and manually labeled 200 data with high topic percentage contribution. If we use 4 topics, we labeled 50 data for each topic.

The best number of topics is 4 topics. Then we try to tune the model.

Tune the Model

Run the LDA model using LDA multicore which uses all CPU cores to work in parallel and increases training speed. For the baseline model, here is the sample code.

Passes and iterations

Passes 20. With 100 iterations. Passes is the number of training passes through the corpus. For example, if the training corpus has 500 documents, passes is 5:

#1 documents 0–500

#2 documents 0–500

#3 documents 0–500

#4 documents 0–500

#5 documents 0–500

Meanwhile, iterations are set loop on each document.

Alpha eta

Alpha: document topic-density

Eta: topic word density

The smaller the alpha, the less likely there will be mixed topics. The greater the alpha, the more likely the topics to be mixed up.

Eta controls the word-by-topic distribution. The smaller the eta, a topic will have fewer words, the larger it is, a topic will likely have more words.

lda_model = gensim.models.ldamulticore.LdaModel(corpus=corpus, num_topics=4, id2word=dictionary, passes=20, iterations=100, alpha=[0.01]*4, eta=[0.01]*len(dictionary.keys()))

This is the difference before and after tuning.

Finally, we used model with hyperparameter tuning. We labeled data for data training topic classification (check my post below).

This is the keyword that we got:

Topic 0: barang, kirim, cepat, murah, belanja, sesuai, harga, pesan, layan, beli
Topic 1: ongkir, gratis, belanja, pakai, bintang, promo, voucher, cod, mahal, diskon
Topic 2: aplikasi, bayar, tolong, akun, iklan, pakai, tipu, beli, update, unduh
Topic 3: belanja, aplikasi, mudah, bantu, online, sukses, rumah, butuh, beli, cari

Based on the paper written by Tavakoli et al., 2018, the following topics can be categorized as follows

Topic 0: General comments

Helpfulness, rating, conformity with buyer’s expectations

Topic 1: Price

Money (worth the money), price-related, additional Cost

Topic 2: Application

Updates (comparing to the previous version), update issues, versioning

Topic 3: Other

Additional programs needed, praise, company

Recap Score:

Here are the results of the baseline and hyperparameter tuning.

Conclusion:

The best model is the model with 4 topics, 100 iterations, passing 20, with alpha=[0.01]*k eta=[0.01]*len(dictionary.keys())

The next step

Do topic classification

References:

https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
https://medium.com/@lettier/how-does-lda-work-ill-explain-using-emoji-108abf40fa7d
https://radimrehurek.com/gensim/models/ldamulticore.html
Tavakoli, M. et al. (2018) ‘Extracting useful software development information from mobile application reviews: A survey of intelligent mining techniques and tools’, Expert Systems with Applications, 113, pp. 186–199. doi: 10.1016/j.eswa.2018.05.037.

Here is my medium post which is similar to this post:

Topic Classification: Review on Indonesia E-commerce Dataset (TF-IDF and Logistic Regression vs…

In the previous story, I have shared how to make a topic from a review dataset using Latent Dirichlet Allocation, but…

medium.com

Predict on New Data using LSTM and Saved it to CSV

I actually wrote this to myself so I wouldn’t forget how to do it, but it would be awesome if you were having the same…

medium.com

Thank you for reading!

Topic Modeling: Latent Dirichlet Allocation on Review Indonesia E-commerce Dataset

Why do we need to do topic modeling?

Latent Dirichlet Allocation

Data Source

Running Airflow in Docker

Background Before we begin to configure Airflow using Docker we must first understand why we need to use workflow…

Scraping App Review Data for Analysis

Background Mobile or portable devices are the most widely used technology of this era. With the increase of internet…

Data Warehouse Architecture for Mobile App Review Analysis

Background In the early days of business, there is only one kind of database, transactional database. As the business…

Preprocessing

Modeling

Labeling Test Data

Tune the Model

Recap Score:

Conclusion:

Here is my medium post which is similar to this post:

Topic Classification: Review on Indonesia E-commerce Dataset (TF-IDF and Logistic Regression vs…

In the previous story, I have shared how to make a topic from a review dataset using Latent Dirichlet Allocation, but…

Predict on New Data using LSTM and Saved it to CSV

I actually wrote this to myself so I wouldn’t forget how to do it, but it would be awesome if you were having the same…

Written by Katarina Nimas Kusumawati