Latent Dirichlet Allocation

Aditya Beri
CodeChef-VIT
Published in
6 min readJun 20, 2020

Topic Modeling using LDA

TOPIC MODELING

Topic Modeling allows us to efficiently analyze large volumes of text by clustering documents into topics. Topic models provide an easy way to analyze large volumes of unlabeled text. A “topic” consists of a cluster of words that often occur together. Using contextual clues, topic models can connect words with similar meanings and distinguish between uses of words with multiple meanings. Also to keep in mind is that a large amount of data is unlabeled, meaning we won’t be able to apply our previous supervised learning approaches to create machine learning models for the data. In the real world, we won’t often have labels attached to text data. So what we have to do is attempt to discover labels for our “unlabeled” data. Talking specifically about text data, it implies an attempt to discover a cluster of documents that are grouped by topic. But…… we always need to keep in mind that here we don’t know what the “correct” label is !. All we say and know is documents clustered together share a similar topic or an idea.

LDA

Introduction

Suppose you have the following set of sentences:

  • I love to eat mangoes and strawberry.
  • I ate a mango and strawberry smoothie for breakfast.
  • Cubs and puppies are cute.
  • We adopted a puppy yesterday.
  • Look at this delicious cream on a piece of strawberry

What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

  • Sentences 1 and 2: 100% Topic A
  • Sentences 3 and 4: 100% Topic B
  • Sentence 5: 60% Topic A, 40% Topic B
  • Topic A: 30% strawberry, 15% mangoes, 10% breakfast, 10% cream, … ( you could interpret topic A to be about food)
  • Topic B: 20% cubs, 20%puppies, 20% cute, … ( you could interpret topic B to be about cute animals)

John Peter Gustav Lejeune Dirichlet was a German mathematician in the 1880s who contributed widely to the field of modern mathematics. There is a probability distribution named after him “Dirichlet Distribution”. Latent Dirichlet is based on this probability distribution.

LDA is a topic model that generates topics based on word frequency from a set of documents. LDA is particularly useful for finding reasonably accurate mixtures of topics within a given document.

Some basic assumptions for LDA-

Documents with similar topics use similar groups of words.

Topics can then be found by searching from groups of words that frequently occur together in documents across the corpus

Documents are probability distributions over latent topics

Topic themselves are probability distributions over words

We can imagine that any particular document is going to have a probability distribution over a given amount of latent topics. So let’s say we imagine there are five latent topics across various documents. Then any particular document is going to have a probability of belonging to each topic. So from the image, we can see that document one has the highest probability of belonging to topic number 2. So we have these discrete probability distributions across topics for each document

Let us now consider another document 2. We see in this case that it does have probabilities of belonging to other topics but we can concretely say that it has the highest probabilities of belonging to topic 4.

Notice here that we are not saying definitively that document 1 belongs to any particular topic or document 2 belongs to any specific topic, instead, we are modelling them as having a probability distribution over a variety of topics

Now if we look at Topics themselves those are simply going to be modelled as probability distributions over words. So we can define Topic 1 as different probabilities belonging to each of these words as belonging to that topic. We can see in the image below that words such as “he” and “food” have lower probabilities of belonging to Topic 1 whereas words like “cat” and “dog” have a higher probability of belonging to it. So here is where we actually as a user try to understand what the topic is representative of. If we were to get this sort of probability distribution across all vocabulary of words in the corpus what we would end up doing is asking for the top 10 highest probability words for Topic 1 and then try to realize what probably the actual topic was. We can make an educated guess for the above case that Topic 1 belongs to a category of pets. LDA or unsupervised learning technique is not going to be able to tell us that directly, we have to study the probability distributions and interpret it.

LDA represents documents as mixtures of topics that spit out words with certain probabilities.

We follow these steps-

Decide on number of words N the document will have

Choose a topic mixture for document (according to Dirichlet distribution over a fixed set of K topics)

Eg 60% business, 20%politics,10%food

Then we generate each word in the document by picking a topic according to the multinomial distribution that we sampled previously (60% business, 20%politics,10%food)

Generate each word in the document by-

Using the topic to generate the word itself (using topic’s multinomial distribution )

For example, if we selected the food topic, we might generate the word “pineapple” with 60% probability, “home” with 25% probability, and so on.

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection

Now imagine we have a set of documents and we have chosen some fixed number of K topics to discover, and want to use LDA to learn the topic representation of each document and words associated with each topic. Then we will go through each document, and randomly assign each word in the document to one of the K topics. This random assignment already gives you both topic representations of all the documents and word distributions of all the topics.

Now we are going to iterate over every word in every document to improve these topics.

For every word in every document and for each topic t we calculate:

p(topic t | document d) = proportion of words in document d that are currently assigned to topic t

Now we iterate over every word in every document to improve these topics.

For every word in every document and for each topic t we calculate:

p(word w | topic t) = proportion of assignments to topic t over all documents that come from this word w

Reassign w a new topic, where we choose topic t with probability

p(topic t | document d)* p(word w | topic t)

This is essentially the probability that topic t generated the word w.

After repeating the previous step a large number of times, we eventually reach a roughly steady state where the assignments are acceptable. At the end, we have each document assigned to a topic. We can also search for the words that have the highest probability of being assigned as a topic.

We end up with an output of such kind -

Now it’s up to the user to decide and interpret these topics!!!!

Thanks to Jose Portilla’s work for helping throughout

All the above discussion is of no use if we don’t apply it to practical use. So I am providing the link to my Github repo for this topic with a hand’s on a project using Latent Dirichlet Allocation. I advise you to work on it. The link for the dataset has also been provided in the repo itself!!

This was just a small sneak peek into what Topic Modeling is and how LDA works.
Feel free to respond to this blog below for any doubts and clarifications!!

--

--