Topic Modeling in R With tidytext and textmineR Package (Latent Dirichlet Allocation)

Published in

The Startup

15 min readAug 2, 2020

In this article, we will learn to do Topic Model using tidytext and textmineR packages with Latent Dirichlet Allocation (LDA) Algorithm.

Natural Language Processing has a wide area of knowledge and implementation, one of them is Topic Model. Topic Model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. Topic modelling is a frequently used text-mining tool for the discovery of hidden semantic structures in a text body. For example “dog”, “bone”, and “obedient” will appear more often in the document about dogs, “cute”, “evil”, and “homeowner” will appear in the document about cats. The “topics” produced by topic modelling techniques are clusters of similar words. A topic model captures this intuition in a mathematical framework, which allows examining a set of documents and discovering, based on the statistics of the words in each, what the topics might be and what each document’s balance of topics is.

Background

What is Topic Modeling Topic Modeling is how the machine collect a group of words within a document to build ‘topic’ which contain group of words with similar dependencies. With Topic models (or topic modeling, or topic model, its just the same) methods we can organize, understand and summarize large collections of textual information. It helps in:

Discovering hidden topical patterns that are present across the collection
Annotating documents according to these topics
Using these annotations to organize, search and summarize texts

In a business approach, topic modeling power for discovering hidden topics can help the organization to understand better about their customer feedback’s So that they can concentrate on those issues customer’s are facing. It also can summarize text for company’s meetings. A high-quality meeting document can enable users to recall the meeting content efficiently. Topic tracking and detection can also use to build a recommender system.

There are many techniques that are used to obtain topic models, namely: Latent Dirichlet Allocation (LDA), Latent Semantic Analysis (LSA), Correlated Topic Models (CTM), and TextRank. In this study, we will focus to implement LDA algorithm to build topic model with tidytext and textmineR package. Not only building model, we will also evaluate the goodness of fit of the model using some metrics like R-squared or log-likelihood. There are also some metrics like coherence and prevalence to measure the quality of topics.

Load these libraries in your working machine:

# data wrangling
library(dplyr)
library(tidyr)
library(lubridate)
# visualization
library(ggplot2)
# dealing with text
library(textclean)
library(tm)
library(SnowballC)
library(stringr)
# topic model
library(tidytext)
library(topicmodels)
library(textmineR)

Topic Model

From the introduction above we know that there are several ways to do topic model. In this study, we will use the LDA algorithm. LDA is a mathematical model that is used to find a mixture of words to each topic, also determine the mixture of topics that describe each document. LDA answer these following principles of topic modeling:

Every document is a mixture of topics. We imagine that each document may contain words from several topics in particular proportions. For example, in a two-topic model we could say “Document 1 is 90% topic A and 10% topic B, while Document 2 is 30% topic A and 70% topic B.” This also can be symbolized as Θ theta
Every topic is a mixture of words. For example, we could imagine a two-topic model of American news, with one topic for “politics” and one for “entertainment.” The most common words in the political topic might be “President”, “Congress”, and “government”, while the entertainment topic may be made up of words such as “movies”, “television”, and “actor”. Importantly, words can be shared between topics; a word like “budget” might appear in both equally. This also can be symbolized as Φ phi

We will use two packages: tidytext including tidymodels package and textmineR. Tidytext package build topic model easily and they provide a method for extracting the per-topic-per-word probabilities, called β (“beta”), from the model. But they don’t provide metrics to calculate the goodness of model like textmineR do.

Latent Dirichlet Allocation (LDA)

LDA is a generative statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s presence is attributable to one of the document’s topics. Plate Notation (picture below) is a concise way of visually representing the dependencies among the model parameters.

Area in M denotes the number of documents
N is the number of words in a given document
α is the parameter of the Dirichlet prior on the per-document topic distributions. High α indicates that each documents is likely to contain a mixture of most of the topics (not just one or two). Low αα indicates each document will likely contain just a few of topics
β is the parameter of the Dirichlet prior to the per-topic word distribution. High β indicates that each topic will contain a mixture of most in the words. low β indicates the topic have a low mixture of words.
θm is the topic distribution for document m
zmn is the topic for the n-th word in document m
wmn is the specific word

LDA is a generative process. LDA assumes that new documents are created in the following way:
1. Determine the number of words in document
2. Choose a topic mixture for the document over a fixed set of topics (example: 20% topic A, 50$ topic B, 30% topic C)
3. Generate the words in the document by:
- pick a topic based on the document’s multinomial distribution (zm,n∼Multinomial(θm))
- pick a word based on topic’s multinomial distribution (wm,n∼Multinomial(φzmn)) (where φzmn is the word distribution for topic z)
4. Repeat the process for n number of iteration until the distribution of the words in the topics meet the criteria (number 2)

Data Import & Objectives

The data is from this kaggle. It's about customers' feedback on Amazon musical instruments. Every row represents one feedback from one user. There are several columns but we only need reviewText which contain the text of the review, overall the product rating from 1-5 given by the user, and reviewTime which contain the time review was given.

# data import and preparation
data <- read.csv("Musical_instruments_reviews.csv")
data <- data %>%
  mutate(overall = as.factor(overall),
         reviewTime = str_replace_all(reviewTime, pattern = " ",replacement = "-"),
         reviewTime = str_replace(reviewTime, pattern = ",",replacement = ""),
         reviewTime = mdy(reviewTime)) %>%
  select(reviewText, overall,reviewTime)
head(data)

So the objectives of this project is to discover what users are talking about for each rating. This will help the organization to understand better about their customer feedback So that they can concentrate on those issues customers are facing.

Tidytext

Text cleaning process

Before we put the text to LDA model, we need to clean the text. We gonna build textcleaner function using several functions from tm, textclean, and stringr package. We also need to convert the text to Document Term Matrix (DTM) format because LDA() function from tidytext package needs dtm format.

# build textcleaner function
textcleaner <- function(x){
  x <- as.character(x)
  
  x <- x %>%
    str_to_lower() %>%  # convert all the string to low alphabet
    replace_contraction() %>% # replace contraction to their multi-word forms
    replace_internet_slang() %>% # replace internet slang to normal words
    replace_emoji() %>% # replace emoji to words
    replace_emoticon() %>% # replace emoticon to words
    replace_hash(replacement = "") %>% # remove hashtag
    replace_word_elongation() %>% # replace informal writing with known semantic replacements
    replace_number(remove = T) %>% # remove number
    replace_date(replacement = "") %>% # remove date
    replace_time(replacement = "") %>% # remove time
    str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
    str_remove_all(pattern = "[^\\s]*[0-9][^\\s]*") %>% # remove mixed string n number
    str_squish() %>% # reduces repeated whitespace inside a string.
    str_trim() # removes whitespace from start and end of string
  
  xdtm <- VCorpus(VectorSource(x)) %>%
    tm_map(removeWords, stopwords("en"))
  
  # convert corpus to document term matrix
  return(DocumentTermMatrix(xdtm))
    
}

Because we want to know the topic from each rating, we should split/subset the data by its rating.

data_1 <- data %>% filter(overall == 1)
data_2 <- data %>% filter(overall == 2)
data_3 <- data %>% filter(overall == 3)
data_4 <- data %>% filter(overall == 4)
data_5 <- data %>% filter(overall == 5)
table(data$overall)>
## 
##   1   2   3   4   5 
##  14  21  77 245 735

From the table above we know that most of the feedback has the highest rating. Because the distributions are different, each rating will have different treatments especially in choosing highest terms frequency. I’ll make sure we will use at least 700–1000 words to be analyzed for each rating.

Topic Modeling rating 5

# apply textcleaner function for review text
dtm_5 <- textcleaner(data_5$reviewText)
# find most frequent terms. i choose words that at least appear in 50 reviews
freqterm_5 <- findFreqTerms(dtm_5,50)
# we have 981 words. subset the dtm to only choose those selected words
dtm_5 <- dtm_5[,freqterm_5]
# only choose words that appear once in each rows
rownum_5 <- apply(dtm_5,1,sum)
dtm_5 <- dtm_5[rownum_5>0,]# apply to LDA function. set the k = 6, means we want to build 6 topic 
lda_5 <- LDA(dtm_5,k = 6,control = list(seed = 1502))
# apply auto tidy using tidy and use beta as per-topic-per-word probabilities
topic_5 <- tidy(lda_5,matrix = "beta")# choose 15 words with highest beta from each topic
top_terms_5 <- topic_5 %>%
  group_by(topic) %>%
  top_n(15,beta) %>% 
  ungroup() %>%
  arrange(topic,-beta)# plot the topic and words for easy interpretation
plot_topic_5 <- top_terms_5 %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()plot_topic_5

Topic Modeling rating 4

dtm_4 <- textcleaner(data_4$reviewText)
freqterm_4 <- findFreqTerms(dtm_4,20)
dtm_4 <- dtm_4[,freqterm_4]
rownum_4 <- apply(dtm_4,1,sum)
dtm_4 <- dtm_4[rownum_4>0,]lda_4 <- LDA(dtm_4,k = 6,control = list(seed = 1502))
topic_4 <- tidy(lda_4,matrix = "beta")top_terms_4 <- topic_4 %>%
  group_by(topic) %>%
  top_n(15,beta) %>% 
  ungroup() %>%
  arrange(topic,-beta)plot_topic_4 <- top_terms_4 %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()plot_topic_4

Topic Modeling rating 3

dtm_3 <- textcleaner(data_3$reviewText)
freqterm_3 <- findFreqTerms(dtm_3,10)
dtm_3 <- dtm_3[,freqterm_3]
rownum_3 <- apply(dtm_3,1,sum)
dtm_3 <- dtm_3[rownum_3>0,]lda_3 <- LDA(dtm_3,k = 6,control = list(seed = 1502))
topic_3 <- tidy(lda_3,matrix = "beta")top_terms_3 <- topic_3 %>%
  group_by(topic) %>%
  top_n(15,beta) %>% 
  ungroup() %>%
  arrange(topic,-beta)plot_topic_3 <- top_terms_3 %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()plot_topic_3

Topic Modeling rating 2

dtm_2 <- textcleaner(data_2$reviewText)
freqterm_2 <- findFreqTerms(dtm_2,5)
dtm_2 <- dtm_2[,freqterm_2]
rownum_2 <- apply(dtm_2,1,sum)
dtm_2 <- dtm_2[rownum_2>0,]lda_2 <- LDA(dtm_2,k = 6,control = list(seed = 1502))
topic_2 <- tidy(lda_2,matrix = "beta")top_terms_2 <- topic_2 %>%
  group_by(topic) %>%
  top_n(15,beta) %>% 
  ungroup() %>%
  arrange(topic,-beta)plot_topic_2 <- top_terms_2 %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()plot_topic_2

Topic Modeling rating 1

dtm_1 <- textcleaner(data_1$reviewText)
freqterm_1 <- findFreqTerms(dtm_1,5)
dtm_1 <- dtm_1[,freqterm_1]
rownum_1 <- apply(dtm_1,1,sum)
dtm_1 <- dtm_1[rownum_1>0,]lda_1 <- LDA(dtm_1,k = 6,control = list(seed = 1502))
topic_1 <- tidy(lda_1,matrix = "beta")top_terms_1 <- topic_1 %>%
  group_by(topic) %>%
  top_n(15,beta) %>% 
  ungroup() %>%
  arrange(topic,-beta)plot_topic_1 <- top_terms_1 %>%
  mutate(term = reorder_within(term, beta, topic)) %>%
  ggplot(aes(term, beta, fill = factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") +
  coord_flip() +
  scale_x_reordered()plot_topic_1

textmineR

Text cleaning process

Just like previous text cleaning method, we will build a text cleaner function to automate the cleaning process. The difference is we don’t need to convert the text to dtm format. textmineR package has its own dtm converter, CreateDtm(). Fitting LDA model with textmineR need dtm format made by CreateDtm() function. We also can set n-gram size, remove punctuation, stopwords, and any simple text cleaning process.

textcleaner_2 <- function(x){
  x <- as.character(x)
  
  x <- x %>%
    str_to_lower() %>%  # convert all the string to low alphabet
    replace_contraction() %>% # replace contraction to their multi-word forms
    replace_internet_slang() %>% # replace internet slang to normal words
    replace_emoji() %>% # replace emoji to words
    replace_emoticon() %>% # replace emoticon to words
    replace_hash(replacement = "") %>% # remove hashtag
    replace_word_elongation() %>% # replace informal writing with known semantic replacements
    replace_number(remove = T) %>% # remove number
    replace_date(replacement = "") %>% # remove date
    replace_time(replacement = "") %>% # remove time
    str_remove_all(pattern = "[[:punct:]]") %>% # remove punctuation
    str_remove_all(pattern = "[^\\s]*[0-9][^\\s]*") %>% # remove mixed string n number
    str_squish() %>% # reduces repeated whitespace inside a string.
    str_trim() # removes whitespace from start and end of string
  
  return(as.data.frame(x))

Topic Modeling rating 5

# apply textcleaner function. note: we only clean the text without convert it to dtm
clean_5 <- textcleaner_2(data_5$reviewText)
clean_5 <- clean_5 %>% mutate(id = rownames(clean_5))# crete dtm
set.seed(1502)
dtm_r_5 <- CreateDtm(doc_vec = clean_5$x,
                   doc_names = clean_5$id,
                   ngram_window = c(1,2),
                   stopword_vec = stopwords("en"),
                   verbose = F)dtm_r_5 <- dtm_r_5[,colSums(dtm_r_5)>2]

create LDA model using `textmineR`. Here we gonna make 20 topics. the reason why we build so many topics is that `textmineR` has metrics to calculate the quality of topics. we will choose some topics with the best quality

set.seed(1502)
mod_lda_5 <- FitLdaModel(dtm = dtm_r_5,
                         k = 20, # number of topic
                         iterations = 500,
                         burnin = 180,
                         alpha = 0.1,beta = 0.05,
                         optimize_alpha = T,
                         calc_likelihood = T,
                         calc_coherence = T,
                         calc_r2 = T)

Once we have created a model, we need to evaluate it. For overall goodness of fit, textmineR has R-squared and log-likelihood. R-squared is interpretable as the proportion of variability in the data explained by the model, as with linear regression.

mod_lda_5$r2
>
## [1] 0.2183867

The primary goodness of fit measures in topic modeling is likelihood methods. Likelihoods, generally the log-likelihood, are naturally obtained from probabilistic topic models. the log_likelihood is P(tokens|topics) at each iteration.

plot(mod_lda_5$log_likelihood,type = "l")

log likelhood for every iteration in rating 5

get 15 top terms with the highest phi. phi representing a distribution of words over topics. Words with high phi have the most frequency in a topic.

mod_lda_5$top_terms <- GetTopTerms(phi = mod_lda_5$phi,M = 15)
data.frame(mod_lda_5$top_terms)

Let’s see the coherence value for each topic. Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high scoring words in the topic. These measurements help distinguish between topics that are semantically interpretable topics and topics that are artifacts of statistical inference. For each pair of words {a,b}, then probabilistic coherence calculates P(b|a)−P(b) where {a} is more probable than {b} in the topic. In simple words, coherence tell us how associated words are in a topic

mod_lda_5$coherence
>
##        t_1        t_2        t_3        t_4        t_5        t_6        t_7 
## 0.12140404 0.08349523 0.05510456 0.11607445 0.16397834 0.05472121 0.09739406 
##        t_8        t_9       t_10       t_11       t_12       t_13       t_14 
## 0.14221823 0.24856426 0.79310008 0.28175270 0.10231907 0.58667185 0.05449207 
##       t_15       t_16       t_17       t_18       t_19       t_20 
## 0.09204392 0.10147505 0.07949897 0.04519463 0.13664781 0.21586105

We also want to look at prevalence value. Prevalence tells us the most frequent topics in the corpus. Prevalence is the probability of topics distribution in the whole documents.

mod_lda_5$prevalence <- colSums(mod_lda_5$theta)/sum(mod_lda_5$theta)*100
mod_lda_5$prevalence
>
##      t_1      t_2      t_3      t_4      t_5      t_6      t_7      t_8 
## 5.514614 5.296280 4.868778 7.484032 9.360072 2.748069 4.269445 4.195638 
##      t_9     t_10     t_11     t_12     t_13     t_14     t_15     t_16 
## 5.380414 3.541380 5.807442 5.305865 3.243890 4.657203 5.488087 2.738993 
##     t_17     t_18     t_19     t_20 
## 4.821128 4.035630 7.385820 3.857221

Now we have the top terms at each topic, the goodness of model by r2 and log_likelihood, also the quality of topics by calculating coherence and prevalence. let’s compile them in summary

mod_lda_5$summary <- data.frame(topic = rownames(mod_lda_5$phi),
                                coherence = round(mod_lda_5$coherence,3),
                                prevalence = round(mod_lda_5$prevalence,3),
                                top_terms = apply(mod_lda_5$top_terms,2,function(x){paste(x,collapse = ", ")}))

modsum_5 <- mod_lda_5$summary %>%
  `rownames<-`(NULL)

We know that the quality of the model can be described with coherence and prevalence value. let’s build a plot to identify which topic has the best quality

modsum_5 %>% pivot_longer(cols = c(coherence,prevalence)) %>%
  ggplot(aes(x = factor(topic,levels = unique(topic)), y = value, group = 1)) +
  geom_point() + geom_line() +
  facet_wrap(~name,scales = "free_y",nrow = 2) +
  theme_minimal() +
  labs(title = "Best topics by coherence and prevalence score",
       subtitle = "Text review with 5 rating",
       x = "Topics", y = "Value")

coherence and prevalence score in rating 5

From the graph above we know that topic 10 has the highest quality, which means the words in that topic are associated with each other. But in the terms of probability of topics distribution in the whole documents (prevalence), topic 10 has a low score. Mean the review is unlikely using combination of words in topic 10 even tough the words inside that topic are supporting each other.

We can see if topics can be grouped together using Dendogram. A Dendrogram uses Hellinger distance (distance between 2 probability vectors) to decide if the topics are closely related. For instance, the Dendrogram below suggests that there are greater similarity between topic 10 and 13.

mod_lda_5$linguistic <- CalcHellingerDist(mod_lda_5$phi)
mod_lda_5$hclust <- hclust(as.dist(mod_lda_5$linguistic),"ward.D")
mod_lda_5$hclust$labels <- paste(mod_lda_5$hclust$labels, mod_lda_5$labels[,1])
plot(mod_lda_5$hclust)

Now we have complete to build topic model in rating 5 and its interpretation, let’s apply the same step for every rating and see the difference of what people are talk about.

I won’t copy and paste the process for every rating because its just the same process and i think it will waste the space. But if you really want to look at it please visit my publications in my rpubs.

Conclusion

We’ve done topic model process from cleaning text to interpretation and analysis. Finally, let’s see what people are talking about for each rating. We will choose 5 different topics with the highest quality (coherence). Each topic will have 15 words with the highest value of phi (distribution of words over topics).

Rating 5

modsum_5 %>% 
  arrange(desc(coherence)) %>%
  slice(1:5)

Highest coherence score, topic 10 and topic 13 contains lots of ‘sticking’ and ‘tongue’ words. Maybe its just a phrase for a specific instrument. It has similar words that make their coherence score rising but low prevalence means they are rarely used in other reviews, that’s why i suggest its from ‘specific’ instrument. in topic 11 and other people are talking about how good the product is, for example, there are words like ‘good’, ‘accurate’, ‘clean’, ‘easy’, ‘recommend’, and ‘great’ that indicates positive sentiment.

Rating 4

modsum_4 %>% 
  arrange(desc(coherence)) %>%
  slice(1:5)

Same as before, topic with the highest coherence score is filled with sticking and tongue stuff. In this rating, people are still praising the product but not as much as rating 5. Keep in mind, the dtm is built using bigrams, words with 2 words like solid_state or e_tongue are captured and calculated just like single word does. With that information, we know that all words showed here have their own phi value and actually represent the review.

Rating 3

modsum_3 %>% 
  arrange(desc(coherence)) %>%
  slice(1:5)

Looks like stick and tongue words are everywhere. `topic 15` has high coherence and prevalence value in rating 3, means lots of review in this rating are talking about them. On the other hand, in this rating, positive words are barely seen. most of the topics filled with guitar or string related words.

Rating 2

modsum_2 %>% 
  arrange(desc(coherence)) %>%
  slice(1:5)

Rating 1

modsum_1 %>% 
  arrange(desc(coherence)) %>%
  slice(1:5)

In the worst rating, people are highly complaint. words like ‘junk’, ‘cheap’ , ‘just’, ‘back’ are everywhere. there’s a lot of difference compared with rating 5.

Overall let's keep in mind this dataset is a combination of products, so its obvious if the topic filled with nonsense. But for every rating we’re able to build topics with different instruments. Most of them are talking about with particular instrument with its positive or negative review. In this project we managed to build topic model that separated by instrument, it shows LDA is able to build topic with its semantic words. It will be better if we do topic model with a specific product only and discover the problems to remove or goodness to keep. It surely help organization to understand better about their customer feedback’s So that they can concentrate on those issues customer’s are facing, especially for those who have lots of reviews to analyze.

Reference

I have posted the full documentation in my rpubs. You can also use the code from my Github, feel free to download and kindly leave a star if you like it.

Article reference:

Topic Modeling in R With tidytext and textmineR Package (Latent Dirichlet Allocation)

Background

Topic Model

Latent Dirichlet Allocation (LDA)

Data Import & Objectives

Tidytext

Text cleaning process

Topic Modeling rating 5

Topic Modeling rating 4

Topic Modeling rating 3

Topic Modeling rating 2

Topic Modeling rating 1

textmineR

Text cleaning process

Topic Modeling rating 5

Conclusion

Rating 5

Rating 4

Rating 3

Rating 2

Rating 1

Reference

Written by Joe Christian