Latent Dirichlet Allocation

Harsh Bansal
Analytics Vidhya
Published in
8 min readMar 3, 2020

Latent Dirichlet Allocation (LDA) serves as a topic modeling technique, adept at categorizing text within a document into specific topics. It leverages the Dirichlet distribution to discern topics for each document model and words for each topic model.

Johann Peter Gustav Lejeune Dirichlet, a prominent German mathematician of the 19th century, made significant contributions to the advancement of modern mathematics. Among his enduring legacies is the Dirichlet Distribution, a probability distribution named in his honor, which serves as the foundation of Latent Dirichlet Allocation (LDA).

How LDA works?

Latent Dirichlet Allocation (LDA) is a method for associating sentences with topics. LDA discerns specific topic sets based on the topics provided to it. Prior to generating these topics, LDA undergoes several preparatory processes. We establish a set of rules and facts before applying these processes.

Assumptions of LDA for Topic Modelling:

  • Documents with similar topics use similar groups of words
  • Latent topics can then be found by searching for groups of words that frequently occur together in documents across the corpus
  • Documents can be viewed as probability distributions across latent topics, indicating that certain documents will contain a higher proportion of words related to specific topics.
  • Topics themselves are probability distribution over words

These are the assumptions users must understand before applying LDA.

Explanation through example:

Suppose we have the following statements:

  • Cristiano Ronaldo and Lionel Messi are both great players of football
  • People also admire Neymar and Ramos for their football skills
  • The USA and China both are powerful countries
  • China is building the largest air purifier
  • India is also emerging as one of the most developing country by promoting football a global scale

With the help of LDA, we can generate sets of topics about which the sentences are. If we take 2 topic sets into consideration then:

  • Sentence 1 and Sentence 2 both belong to topic 1
  • Sentence 3 and Sentence 4 both belong to topic 2
  • Sentence 5 depicts 70% Topic 1 and 30% Topic 2

LDA, or Latent Dirichlet Allocation, posits that each document comprises a variety of contexts that correspond to different topics. Consequently, a document can be depicted as a collection of diverse topics, each composed of numerous words with specific probabilities. In LDA, each document possesses its unique characteristics, and the model assumes certain rules and regulations before generating a document. For instance, there is typically a word limit imposed, with users specifying a certain amount of words. Additionally, diversity in document content is encouraged, with documents ideally referencing a range of contexts, such as 60% business, 20% politics, and 10% food. Moreover, every keyword within a document is associated with a particular topic, and this relationship can be inferred using multinomial distribution. For instance, a word associated with the business domain might have a probability of occurrence at 3/5, while its association with politics could be at 1/5, as we discussed previously

Assuming this model is applied to a collection of documents, LDA endeavors to backtrack from the documents to identify a set of topics that are relevant to the context of the documents.

Now we try to understand its full working

Since we have a set of documents sourced from a specific dataset or obtained randomly, we first determine a fixed number of K topics to uncover. We then employ LDA to learn the topic representation of each document and the words associated with each topic.

The LDA algorithm proceeds by iterating through each document and randomly assigning each word in the document to one of the K topics. This initial random assignment provides initial topic representations for all documents and word distributions for all topics. Subsequently, LDA iterates over every word in every document to refine these topics. However, the initial topic representations may not be optimal, necessitating improvement. To address this limitation, a formula is devised where the core functionality of LDA comes into play.

Plate Notation representing LDA model:

M denotes the number of documents
N is the number of words in a given document (document i has {\displaystyle N_{i}}N_{i} words)
α is the parameter of the Dirichlet prior on the per-document topic distributions
β is the parameter of the Dirichlet prior on the per-topic word distribution
theta is the topic distribution for document i
varphi is the word distribution for topic k
z is the topic for the j-th word in document i
w is the specific word.

Explanation with simple terms:

For every word in every document and for each Topic T, we calculate:
P(Topic T | Document D) = the proportion of words in Document D that are currently assign to Topic T

P(Word W | Topic T) = the proportion of assignments to topic T over all documents that come from this word W

Reassign w to a new topic where we choose Topic T with probability P(Topic T | Document D) * P(Word W | Topic T).This is essentially that Topic T generated word w

After repeating the previous step a large number of times, we eventually reach a roughly steady state where the assignments become acceptable. At this stage, each document is assigned to a specific topic. We can then search for the words that have the highest probability of being assigned to a given topic.

We ended up output such as

  • Document assigned to topic 4
  • Most common words (highest probability) for topic 4 (‘cat’,’vet’,’birds’,’dog’…)
  • It is up to the user to interpret these topics.

Two important notes:

  • The users must decide on amount of topics present in the document
  • the user must interpret what the topics are

In essence, with LDA, if we possess a collection of documents and aim to generate a set of topics for representing these documents, we can achieve this through LDA. The process involves training LDA on each document, wherein it iterates through the document and assigns words to topics. However, this isn’t a single-step process. Initially, LDA randomly assigns words to topics, kicking off a learning procedure. It then traverses through each word in each document, applying the formula discussed earlier. Through numerous iterations of this procedure, it eventually yields a set of topics.

Implementation

We will try to understand LDA more briefly by applying it to a dataset.

The dataset we’re utilizing comprises information or news sourced from www.npr.org, encompassing the latest global news. Our objective is to implement LDA on these news columns to discern the most prevalent topics worldwide. Additionally, we aim to assign topics to future news articles based on our findings.

Data Pre-Processing:

import pandas as pd
df = pd.read_csv('npr.csv')
df.head()

Notice how we don’t have the topic of the articles! Let’s use LDA to attempt to figure out clusters of the articles.

from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_df=0.95, min_df=2, stop_words='english')
dtm = cv.fit_transform(df['Article'])

Count Vectorizer: CountVectorizer is a fundamental component of natural language processing and is often paired with TF-IDF. However, in our case, we opt for TF-IDF over CountVectorizer. TF-IDF calculates the frequency of a token’s occurrence in a document and utilizes this value as its weight. We utilized CountVectorizer to convert the text data into a machine-readable format.

  • max_df: float in the range [0.0, 1.0] or int, default=1.0
    It is used to remove words that appear too frequently. If max_df = 0.50 means “ignore terms that appear in more than 50% of the documents”. If max_df = 25 means “ignore terms that appear in more than 25 documents”. The default max_df is 1.0, which means “ignore terms that appear in more than 100% of the documents”. Thus, the default setting does not ignore any terms.
  • min_df: float in range [0.0, 1.0] or int, default=1
    It is used to remove words that appear rarely. If min_df = 0.01 means “ignore terms that appear in less than 1% of the documents”.If min_df = 5 means “ignore terms that appear in less than 5 documents”. The default min_df is 1, which means “ignore terms that appear in less than 1 document”. Thus, the default setting does not ignore any terms.

LDA model:

from sklearn.decomposition import LatentDirichletAllocation
LDA = LatentDirichletAllocation(n_components=7,random_state=42)
LDA.fit(dtm)

Showing Stored Words:

len(cv.get_feature_names())
>>>54777
for i in range(10):
random_word_id = random.randint(0,54776)
print(cv.get_feature_names()[random_word_id])
>>>cred
fairly
occupational
temer
tamil
closest
condone
breathes
tendrils
pivot
for i in range(10):
random_word_id = random.randint(0,54776)
print(cv.get_feature_names()[random_word_id])
>>>foremothers
mocoa
ellroy
liron
ally
discouraged
utterance
provo
videgaray
archivist

Showing top words per topic

len(LDA.components_)
>>>7
len(LDA.components_[0])
>>>54777
single_topic = LDA.components_[0]# Returns the indices that would sort this array.
single_topic.argsort()
# Word least representative of this topic
single_topic[18302]
# Word most representative of this topic
single_topic[42993]
# Top 10 words for this topic:
single_topic.argsort()[-10:]
>>>array([33390, 36310, 21228, 10425, 31464, 8149, 36283, 22673, 42561,
42993], dtype=int64)

top_word_indices = single_topic.argsort()[-10:]
for index in top_word_indices:
print(cv.get_feature_names()[index])

These look like business articles perhaps. We will perform .transform() on our vectorized articles to attach a label number. But before, we view all the topics found.

for index,topic in enumerate(LDA.components_):
print(f'THE TOP 15 WORDS FOR TOPIC #{index}')
print([cv.get_feature_names()[i] for i in topic.argsort()[-15:]])
print('\n')

Attaching Discovered Topic Labels to Original Articles

topic_results = LDA.transform(dtm)
npr['Topic'] = topic_results.argmax(axis=1)

Limitations

  • There is a limit to the amount of topics we can generate
  • LDA is unable to depict correlations which led to the occurrence of uncorrelated topics
  • There is no development of topics over time
  • LDA assumes words are exchangeable, sentence structure is not modeled
  • Unsupervised (sometimes weak supervision is desirable, e.g. in sentiment analysis)

With this, you have the complete idea of Latent Dirichlet Allocation (LDA). Enjoy.

--

--