Diving into Indonesian Skincare Reviews — Part 2: Topic Modelling using BERTopic and Llama 2

7 min readMay 27, 2024

So, in the first post analyzing the 4,387 reviews of 5X Ceramide Barrier Repair Moisture Gel Moisturizer by Skintific, we utilized WordCloud to find out the most common words being talked about.

For positive words, we got:

moisturizer/moist/melembabkan/pelembab
tekstur
barrier
wangi (in terms of smell)
calming
hidrasi

While for negative reviews, we got:

jerawat
bruntusan
kering
minyak/oily

Nowadays, I’m also been learning about BERTopic. It’s one of Topic Modelling method. What is a Topic Modelling?

Topic modeling is a technique that involves assigning topics to a given corpus of text based on the words that are present within it. It is a powerful technique that helps in uncovering the hidden themes or topics present in a large corpus of unstructured text data. (source: https://medium.com/@ananyajoshi20?source=post_page-----8cb77a8f3fe5--------------------------------)

So, let’s say in the negative reviews, people are saying that this product gives them these bad effects of acne, dry skin, oily skin, but is that all? Is there more to unveil?

The data I use is a csv file consist of 3 columns: ID, Sentiment, and the Reviews.

import pandas as pd
import requests
from nltk.corpus import stopwords
import nltk
from nltk.tokenize import word_tokenize


# Load the dataset
df = pd.read_csv('reviews_skintific_5xceramide - Copy.csv', encoding="ISO-8859-1")

df_neg = df[df['Sentiment'] == 'negative']
df_pos = df[df['Sentiment'] == 'positive']

df_neg.head()

Then, I constructed custom Indonesian stopwords. Part of it is inspired by this story.

# CONSTRUCT STOPWORDS
rama_stopword = "https://raw.githubusercontent.com/ramaprakoso/analisis-sentimen/master/kamus/stopword.txt"
yutomo_stopword = "https://raw.githubusercontent.com/yasirutomo/python-sentianalysis-id/master/data/feature_list/stopwordsID.txt"
fpmipa_stopword = "https://raw.githubusercontent.com/onlyphantom/elangdev/master/elang/word2vec/utils/stopwords-list/fpmipa-stopwords.txt"
sastrawi_stopword = "https://raw.githubusercontent.com/onlyphantom/elangdev/master/elang/word2vec/utils/stopwords-list/sastrawi-stopwords.txt"
aliakbar_stopword = "https://raw.githubusercontent.com/onlyphantom/elangdev/master/elang/word2vec/utils/stopwords-list/aliakbars-bilp.txt"
pebahasa_stopword = "https://raw.githubusercontent.com/onlyphantom/elangdev/master/elang/word2vec/utils/stopwords-list/pebbie-pebahasa.txt"
elang_stopword = "https://raw.githubusercontent.com/onlyphantom/elangdev/master/elang/word2vec/utils/stopwords-id.txt"
nltk_stopword = stopwords.words('indonesian')

# create path url for each stopword
path_stopwords = [rama_stopword, yutomo_stopword, fpmipa_stopword, sastrawi_stopword, 
                  aliakbar_stopword, pebahasa_stopword, elang_stopword]

# combine stopwords
stopwords_l = nltk_stopword
for path in path_stopwords:
    response = requests.get(path)
    stopwords_l += response.text.split('\n')

# Remove specific words from the combined stopwords list
specific_stopwords = {"buat", "membuat", "tidak","no"}
stopwords_l = [word for word in stopwords_l if word not in specific_stopwords]

custom_st = '''
yg yang dgn ane smpai bgt gua gwa si tu ama utk udh btw gt
ntar lol ttg emg aj aja tll sy sih kalo nya trsa mnrt nih
ma dr ajaa tp akan bs bikin kta pas pdahl bnyak guys tnx
bang ...........
'''

# create dictionary with unique stopword
st_words = set(stopwords_l)
custom_stopword = set(custom_st.split())

# result stopwords
custom_stop_words = st_words | custom_stopword

After that I pre-processed the reviews until it became tokenized.

# Function to preprocess Indonesian text
def preprocess_text(text):
    # Tokenize the text using NLTK
    tokens = word_tokenize(text)
    # Remove punctuation
    tokens = [word for word in tokens if word.isalnum()]
    # Remove stopwords
    stop_words = custom_stop_words
    tokens = [word for word in tokens if not word.lower() in stop_words]
    # Remove 'nya' suffix at the end of words
    tokens = [word[:-3] if word.endswith('nya') else word for word in tokens]
    return " ".join(tokens)


# Preprocess negative reviews
preprocessed_reviews_neg = [preprocess_text(review) for review in df_neg.review.values]

# Preprocess positive reviews
preprocessed_reviews_pos = [preprocess_text(review) for review in df_pos.review.values]

print('preprocess done')

For a reminder, this is the result of tokenized sentence.

Finally we are coming to training the preprocessed reviews to get the topics. The code below is inspired by this story and this story.

from sklearn.feature_extraction.text import CountVectorizer
from bertopic.representation import KeyBERTInspired, MaximalMarginalRelevance

main_representation_model = KeyBERTInspired()
aspect_representation_model = [KeyBERTInspired(top_n_words=30), 
                                MaximalMarginalRelevance(diversity=.5)]

representation_model = {
   "Main": main_representation_model,
   "Aspect":  aspect_representation_model
}

vectorizer_model = CountVectorizer(min_df=5)

# Initialize BERTopic models for negative and positive reviews
topic_model_neg = BERTopic(nr_topics = 10, 
                           vectorizer_model = vectorizer_model,
                           representation_model=representation_model)
topic_model_pos = BERTopic(nr_topics = 10, 
                           vectorizer_model = vectorizer_model,
                           representation_model=representation_model)

print('Training topic model for negative reviews...')
topics_neg, ini_probs_neg = topic_model_neg.fit_transform(preprocessed_reviews_neg)

print('Training topic model for positive reviews...')
topics_pos, ini_probs_pos = topic_model_pos.fit_transform(preprocessed_reviews_pos)

df_neg['topic'] = topics_neg
df_neg['topic_prob'] = ini_probs_neg
df_pos['topic'] = topics_pos
df_pos['topic_prob'] = ini_probs_pos

topics_info_neg = topic_model_neg.get_topic_info()
topics_info_pos = topic_model_pos.get_topic_info()


# Obtain the number of topics for negative and positive reviews
num_topics_neg = len(set(topics_neg))
num_topics_pos = len(set(topics_pos))

print(f'Number of topics for negative reviews: {num_topics_neg}')
print(f'Number of topics for positive reviews: {num_topics_pos}')

I limit the result of the topics to be 10 only for both negative and positive reviews. The reason is after running it in nr_topics = ‘auto’, then I got like 30 to 40 topics, the difference between the resulted topics are not so vastly different.

So, this is the result I got:

Negative Reviews

First, let’s take a look at the result of Negative Reviews topics.

It’s resulted in 5 topics, labelled from Topic 0 until Topic 4. The -1 topic is the outlier that BERTopic thinks couldn’t fit into the other topic groups.

At glance, we can see that all the topics talk about “sayang jerawatan”, so this emphasize what the WordCloud has told us about. However, from this topics, we can draw more context, like

Topic 0 is talking about how the consumers are trying the product because of its virality, it’s giving the moist, but then unfortunately causing breakout/acne.

The topics look similar but also have subtle differences on different pov of how the reviews are written. Below is the Intertopic Distance Map of the topics. From the map, it concurs that those 5 topics are different.

Now, I’m about to give short label to each of the topic.

Topic 0: Viral but
Topic 1: Pricey but
Topic 2: Tried travel size but then

However, I found that Topic 3 and Topic 4 are very similar. Let’s take a look at the representative docs (reviews) of them.

Representative Negative Reviews of Topic 3

Representative Negative Reviews of Topic 4

In my opinion, those two topics have same angle, the broad general one: it was good but then causing breakout, no additional pov like bought due to virality. So, I merge these two topics.

However, due to still lack of my skills, I don’t know why this merged topics became Topic 0, then the original Topic 0 became Topic 1, the original Topic 1 became Topic 2, and the original Topic 2 became Topic 3.

So, the labels became:

Topic 0 (the merged topics): It was giving moist but
Topic 1: Viral but
Topic 2: Pricey but
Topic 3: Tried travel size but

In conclusion, utilizing this BERTopic, we can gather more POV of the disappointment.

Positive Reviews

Now, let’s take a look at the positive reviews topics.

For these positive topics, I am trying to utilize Llama 2, an AI to help me with the short labelling, drawing clearer conclusion from those words in each topics. This is also due to my curiosity if it can be used for Indonesian language.

In my opinion, looking at the result above, it’s surprisingly makes sense even though not perfect. I try to smooth out those labels below:

Topic 0 and 3 will be merged : Good product that moisture and brighten
Topic 1 and 2 will be merged: Moisturizer for sensitive, oily, and acne-prone skin
Topic 4: Viral but works
Topic 5: Loving the travel size
Topic 6: Product that repair the skin barrier
Topic 7: Would love a pump packaging or jumbo jar
Topic 8: Love the added spatula to the packaging

So, for these positive reviews, we can see not only the general praise of moisturizing, the topics also gave additional pov such as how a lot of customers like buying travel size and would love another type of packaging.

Thank you for following this reviews analysis project. See you on the next one!

Diving into Indonesian Skincare Reviews — Part 2: Topic Modelling using BERTopic and Llama 2

Negative Reviews

Positive Reviews

Written by Syifa Addini