A Lite Introduction to Latent Dirichlet Allocation

Vonn N Johnson
3 min readMay 17, 2019

--

Natural Language Processing, the subfield where you can do speech recognition, natural language understanding, and natural language generation, has been slowly finding its way into your daily life. However, how does it work? It’s the critical engine to the AI market growth over the next decade in software, services, and hardware. There is no stopping its rise. However, I want to do surface level introduction into the different aspects of Natural Language Processing that drives this growth and what makes it so cool.

One of these intriguing aspects of Natural Language Processing is topic modeling, specifically the Latent Dirichlet Allocation. Topic modeling is a type of statistical modeling to find different that may exist in a corpus. It is not always apparent what the topic of a document is. Say, for instance, you had a document that mentioned iPads, Thanos, and the Drake Equation, what would the topic be? A bit disorienting if you ask me, but therefore we use Latent Dirichlet Allocation.

The Latent Dirichlet Allocation can be broken down like this. First latent that means hidden, and since we are talking about Topic modeling, we can say that we are trying to discover the hidden topics in a corpus. “Dirichlet” isn’t as apparent as “Latent” so let’s break that down. Named after Peter Gustav Lejeune Dirichlet, this is a probability distribution known as the Dirichlet Distribution, a continuous multivariate probability distribution. All that to say that to determine the topic, we will look at the distribution of words and the probability of a word associated with a topic. The distribution doesn’t have to be about words, but for now, for our understanding, we will stick to that.

Since this is a lite introduction, I won’t be introducing any math, just analogy to understand the intuition behind what is going on. Let’s say you have two clipboards with sheets on them with words such as Fish, Taco, and Bread on the first; and Turtle, Bird, and Cat on the second. What would you say the topic is of the first clipboard? Food? Correct, and the second? Animals.

Okay now Let’s say the first clipboard says Fish, Lamb, and Chicken and the other says Fish, Lamb, and Chicken. Well, it could either be about Food or Animals. To drive the point home, let’s now assume that these words are mixed with a plethora of other words, and now there is a distribution of these words. The probability of Chicken, Fish, and Lamb showing up in the first clipboard is far less than 1% while other words like Bird, Turtle, and Cow show up more. We are starting to get a sense that maybe the topic is about animals and not food.

See not that scary. What makes Latent Dirichlet Allocation great also makes it disappointing. The Latent Dirichlet Allocation is considered a fuzzy model. As in there are no hard and fast clear answers when you complete the model. It is wide open for interpretation. There is no metric for you to determine if you made the correct discovery or not. The parameters of the Allocation allow you to optimize your results, but you can’t ever say for sure if you got it right.

That is the beauty of it. It is up to you to determine the topic. It is very subjective, but after several passes through the model, you begin to stabilize the information, and the association of words begins to make some sense as to what the topic may be. So next time you are working with a corpus of text and want to have some fun, give Latent Dirichlet Allocation a shot and see what you can discover.

--

--

Vonn N Johnson

I am a Data Analyst w/ a bad habit of trying to explain things simpler.