Roll the dice to write your email
An informal introduction to probabilistic topic modelling
Exploration is a crucial aspect of data analysis: it provides a first insight on what is happening in our data set and it helps us deciding the next steps.
If we are dealing with numerical variables, like age or weight, exploration is relatively easy: finding the minimum, maximum, mean and median of a variable is already enough to get an idea of how that variable is distributed across our sample. In the case of categorical variables, like nationality or gender, we can rank them by frequency — or group the numerical variables by category to find how those categories are “shaped”.
When we deal with unstructured (or non-tabular) data though, such as text, things get more complicated. Let’s say in our table we have a column called email, where each row is a message sent to us by a customer. How do we compute the “mean” of that column? In other words, how do we summarise the content of our email messages without having to read them all?
Topic Modelling tries to answer the question above by detecting the underlying themes — or topics — that appear in our set of documents, where a “document” can be any kind of text object like an email, a text message, a call centre note or a report.
What is a topic, though? If we want a computer to find topics, we need to express the concept of “topic” in a rigorous, formal way. We can’t just tell the computer “go and find related concepts”. That’s why we need a mathematical model of a topic.
Writing a random email
Latent Dirichlet Allocation (LDA) tells us every email message can be seen as a sequence of dice rolls. How?
For simplicity, suppose we can use only 6 words to write a very short email. We can use a single die where every face corresponds to a word.
What happens if I roll the die 3 times? We may get some weird sequences like:
copy repair cover
But we may also get something more understandable like:
boiler need repair
Let’s put some weights on the boiler, need and repair faces then. This way, our weighted die will be more likely to yield meaningful phrases, all related to a “boiler repair” topic.
However, what if we want to write sentences about a different topic, for example if we need a copy of our energy bill? We can’t do that with our current die — but we can build another die with different weights! Starting from the same initial die, the new one will be weighted on the need, copy and bill faces. Here’s how our two dice would look like:
Learning the topics
Of course, we knew which faces to weigh because we know that, in our world, repair boiler makes more sense than copy boiler. However, our computer doesn’t know what makes sense and what not; yet, it needs to find a way to figure it out. The idea is that it can learn it by scanning all the messages we have in our data set.
The two most popular algorithms for LDA are variational inference and Gibbs sampling, and both are too complex to be described here. Let me sketch roughly how can we use them.
Suppose we want to find 3 topics in our set of messages: this means we start with 3 identical unbiased dice (like the one in Figure 1). The computer begins looping over and over our messages, adjusting the weights of each die at every step based on what it sees. To put it simply, words that appear together in a certain group of messages will be assigned more weight on the same die.
At the end of the process, the algorithm will have tweaked the dice so that they are able of generating the messages it saw in our data set (as much as possible).
This means we may end up with two dice like those in Figure 2, plus the third one looking like this:
By inspecting the weights we can conclude that the 3 topics in our set of messages are “boiler repair”, “bill copy” and “boiler cover”. And we didn’t have to read anything!
Multi-topic messages
So in order to find the topics, our computer must learn how to generate the messages in our data set. Most of the times though, email messages (and documents in general) contain more than one topic — if we want to generate realistic messages, we must take this into account. Using one of the dice we saw previously would only help us writing monothematic messages. We could use more than one die, but how do we choose which one and when to roll it?
We could flip a coin! For example, let’s say we want to write a message about two topics, so we use two dice. We could flip a coin to decide which one to choose, then roll the chosen die and write the word we get. The same for the next word, and so on until we decide to stop. The resulting message will be a bit chaotic and the order of the words not coherent, but the presence of two topics should be easily detectable by looking at the words.
In this configuration, every word will have the same chance to be drawn from Topic 1 or Topic 2. However, we could also add a weight to the coin so that our biased coin could choose a topic die more often than the other: we could generate a message 80% about “boiler repair” and 20% about “boiler cover”. This weight can be computed by the same algorithms we mentioned in the dice-building phase: in fact, all the weights are computed at the same time.
Scaling up
In real life we won’t get very far by using only six words. To write a message in English, we would need a 180,000-face die. Not only that: we could have dozens of topics across all our documents, so we would need another, say, 100-face weighted die instead of a simple coin.
Luckily, this is something we can do by running LDA on a computer: building special dice, that is estimating probability distributions. After all, a standard die is nothing else than a uniform probability distribution over 6 different values. A topic in English is a non-uniform probability distribution over 180,000 different values (words). A document can be a non-uniform probability distribution over 50 different topics. Or at least, this is what Latent Dirichlet Allocation assumes.
Conclusions — what’s next?
I hope this post helped to shed more light on the assumptions behind Latent Dirichlet Allocation, which is still one of the most popular approaches to Topic Modelling.
In the next post I would like to go through a step-by-step implementation of LDA in Python using Scikit-Learn and pyLDAvis, with a section about how to create a report where every document is tagged with the assigned topics.