Latent Dirichlet Allocation

J.P. Rinfret
The Startup
Published in
6 min readDec 18, 2019

What the heck is LDA and how is it used for topic modeling?

Photo by Dan Gold on Unsplash

Humans have at least two advantages over computers: we understand context and can pick up on the emotions attached to text (“K.”). Computers on the other hand do none of these things particularly well without a little help. So how does a computer cluster words into topics like it does in say sentiment analysis and topic models?

Thats where LDA comes in handy. By assuming documents are nothing more than a probability distribution of topics, and topics are nothing more than a probability distribution of words, LDA calculates a probability that a document is mostly this topic or mostly that topic (e.g., Document N is 77% topic 1, 10% topic 2, 8% topic 5, and 5% topic 7) based on the words it contains.

Basically what this boils down to is this: documents with similar topics have a higher probability of using similar groups of words.

But before we can take a look at how this done, we most let go of all preconceived notions for how we humans draft text documents in order to understand how a computer thinks we draft text documents.

Plate Notation

As defined by Wikipedia:

In Bayesian inference, plate notation is a method of representing variables that repeat in a graphical model. Instead of drawing each repeated variable individually, a plate or rectangle is used to group variables into a subgraph that repeat together, and a number is drawn on the plate to represent the number of repetitions of the subgraph in the plate. The assumptions are that the subgraph is duplicated that many times, the variables in the subgraph are indexed by the repetition number, and any links that cross a plate boundary are replicated once for each subgraph repetition.

Plate Notation for Latent Dirichlet Allocation

Let’s define each model parameter:

  • M is the total number of documents within the corpus
  • N is the total number of words in a document
  • α is the latent per-document topic probability distribution. Visually, α points into M but does not penetrate N. This symbolizes that α is the same across all of M (the documents) and describes a sub-set of the documents (i.e., the topics). A high value for α implies each document is likely to contain a mixture of most of the topics
  • β is the latent per-topic word probability distribution. Visually, β penetrates both M and N, making it a word describing parameter at the topic level. A high value for β implies each topic is likely to contain a mixture of most of the words
  • 𝚹 is the topic distribution per document M (𝚹 is contained by M). You can compare the word distributions of topics in document 1 to document 2 by comparing 𝚹₁ and 𝚹₂
  • Z is used to notate each topic assignment as a mixture of words. Z₁₂ is the topic for the 2nd word in document 1.
  • W is used to denote each specific word. W₁₂ would be the 2nd word in document 1.

The goal of LDA is to use N, W, Z, and 𝚹 to describe α and β.

Simply put, according to LDA, if we wanted to create a new document X, we would first determine the number of words in document X (which is N), choose a topic mixture for X over a fixed number of topics (i.e., set 𝚹), and then come up with the words of X by:

  • picking each word’s topic (Z) based X’s topic distribution (𝚹)
  • picking a word (W) based on that word’s assigned topic (Z)

Once we have each word (W) and the topic of each word (Z), as well as the actual distribution of the topics in each document (𝚹), we can calculate the word distribution of each topic (β) and the topic distribution of each document (α). When viewed at the corpus level, we have our probability distributions per-document and per-topic.

Of course, this is not how an actual human being drafts a text document, though it has proven to be a useful generative process for topic modeling.

Let’s put numbers to an example

Say we have a group of newspaper articles that we assume to be only about sports, politics, and tech. Each of these topics are describe by the following words:

  • Sports: game, score, win, lose, home, away
  • Politics: Republicans, Democrats, Trump, Pelosi, bipartisan
  • Tech: data, computer, machine, gadget, wearables

LDA would assume that these articles were drafted by first choosing the length of the article (a 1,000 word column would have N=1,000). Then we would determine the topic mix of our article, or 𝚹. Let’s say our one-thousand word column is 50% politics, 20% sports, and 30% tech. That means we would choose 50 words from politics, 20 words from sports, and 30 words from tech. Finally, as we are writing our article, we would chose a word (W) with topic (Z) to met this 50/20/30 split based on the distribution defined above (𝚹). If we were to fill the December 17th edition of this newspaper with various articles following the same creation algorithm, we could easily compute the newspaper’s latent word probability distribution for each topic (β), as well as the newspaper’s topic distribution for each article (α).

Remember syntax, context, emotion, etc. are lost on a computer! While we would never be able to comprehend this newspaper, a computer would have no problem modeling out the per-topic word probability distribution and the per-document topic probability distribution.

Working Backwards

Humans actual draft text documents in the opposite direction of LDA. We write a collection of words (phrases) that have context and meaning, but also follow basic rules or syntax. So, in order to use LDA to solve for α and β of an article written by a human (with context, syntax, emotion, etc.), we will follow the creation algorithm in reverse order.

First, let’s randomly assign each word W in each document X to one of K pre-determined topics (i.e., set Z for every W in every document X). Again, LDA assumes that every corpus has a latent per-document probabilistic topic distribution and a latent per-topic probabilistic word distribution. In other words, we assume that each document in a corpus can be mostly one of say 7 topics.

For each document X, we will assume that all randomly generated topic assignments Z for each word W are correct except for the current word W we are analyzing.

In order to find the correct topic assignment Z for this word W, we will need to calculate two probabilities:

  • The proportion of words in document X that are currently randomly assigned to topic Z, this is 𝚹 or P(Z|X) = P(Z∩X) / P(X). In other words, what is the topic distribution of document X
  • The proportion of topic assignments over all documents that contain word W, this is β or P(W|Z) = P(W∩Z) / P(Z)

We can then assign word W and new topic assignment Z based on this per-document topic probability distribution relative to all documents that contain W, or 𝚹 * β = α. In other words, we are assuming that the current topic assignment of a word is wrong and using the inherent features of the document to assign the “correct” topic to a word based on probability.

If we iterate through each word W and repeat this process enough times, we should reach a steady state where all assignments make sense based on the assumed latent probability distributions and model parameters (i.e., α will approach the true latent value of α).

And that my friends, is LDA!

Credit Where Credit is Due

I highly recommend everyone watch Scott Sullivan’s YouTube video (here) on LDA, especially his summary of core assumptions and conclusions. His video was absolute the basis for this blog and the foundation of my understanding of LDA.

--

--

J.P. Rinfret
The Startup

TEM @ Komodo Health — Blogs & Opinions are Mine Only | Data Science & Machine Learning at Flatiron School | Mathematics at Fairfield University