Probabilistic Topic Models
Mixture of Unigram Language Models.
UNDERSTANDING FROM THE VERY BASIC
While more and more texts are available online, we simply do not have the human power to read and study them. We need new computational tools to help organize, search, and understand these vast amounts of information. To this end, machine learning researchers have developed probabilistic topic modeling, a suite of algorithms that aim to discover and annotate large archives of documents with thematic information. Topic modeling algorithms are statistical methods that analyze the words of the original texts to discover the themes.These algorithms do not require any prior annotations or labeling of the documents the topics emerge from the analysis of the original texts. Topic modeling enables us to organize and summarize electronic archives at a scale that would be impossible by human annotation.
Lets consider a document “Text mining paper” denoted by d . the document consists a plethora of vocabulary .There are some words which uniquely represents or tell us about the document content like “text”, “association”, “clustering ”, “computer” and some very common words like determiners “the”, “a”, “an”, “this” , “that” ,“those” which apart from grammatical point of view ,are not that important for our document .So our aim is to separate those unique words from common words.
Lets start with assigning each word a probability of :
There are a number of probability distributions that can be used to generate the probability of each word in the document, however here the probability is computed by , dividing each word’s count by the total count of all words in the document.
SO HOW TO GET RID OF COMMON WORDS ?
Well lets consider two distributions and name them as “background topic ” and “topic” and imagine them as two bags. Each distribution contains both topic as well as common words with their respective probabilities but the only difference is one distribution assigns a higher probability to a particular word as compared to other distribution .What this means is “text” is assigned a higher probability in topic distribution as compared to background distribution .
For simplicity I have assigned the probabilities of choosing either of the distribution to be equally likely. However these probabilities can be different depending on our preference to distributions.
How to compute the probability of observing “the ” and “text ” in our text mining paper? .
First lets calculate probability for the word “the”.
This is read as:
In the exact similar way we calculate the probability for the word “text”.
BIGGER PICTURE OF THE MIXTURE MODEL
MIXTURE MODEL IN EQUATION FORM:
Our aim is to build an optimal model by the combination of the given parameters so that we can discover the main topics that prevail in our available document.
As Earlier we calculated the probability of observing “the” in the document d . Similarly we will be calculating probabilities for observing each and every word(lets say document has M words) in the document.
Now here is some interesting behavior to note about mixture model.
Lets suppose we had just two words in the entire document d which are not assigned probabilities in the topic distribution.
Since we want to maximize our likelihood function we would be looking for optimal values for the probabilities but subject to some conditions :
Here’s a little Mathematical fact
When sum of two variables is constant their product is maximum when both are equal.
Why all of a sudden a fact and that too mathematical ?? 😬
Well we would be using this fact on our Maximum likelihood estimate.
BEHAVIOR 1:
When we equate the two terms equal to each other in order to fulfill the constraint condition we are set to get higher probability for the word “text ” in the topic distribution. This is because our background topic assigned “the” with higher probability as compared to the topic distribution.
Hence the two distribution and would be collaborating to collectively contribute to maximize the Maximum likelihood estimate but at the same time would also be competing to bet for the higher terms of the words . Hence by fixing the background distribution to assign higher probabilities to the common words we can encourage the topic distribution to assign low probabilities to the common words .Thus the background topic would be less able to explain the content word “text”as compared to the topic distribution.
BEHAVIOR 2:
What if the count of the word “the” increases in out document?
Then our Maximum Likelihood estimate would look like this:
To maximize the ML estimate the probability of choosing “the ” from the topic distribution also increases because “text” being just one word in the entire document and making any changes to the probability of “text” won’t have that much effect on the ML estimate.
BEHAVIOR 3:
Now consider another scenario where the probabilities of both distributions is not equal to 0.5 but instead we have a larger probability of background distribution for example 0.7 . This increases the overall value for the ML estimate since the probability of background distribution has increased from 0.5 to 0.7 . Hence it becomes less important to assign the word “the” higher value in the topic distribution.
[1]: Blei ,D.2012. “Probabilistic Topic Models.” Communications of the ACM 55(4): 77–84, doi: 10.1145/2133806.2133826.
[2]: Qiaozhu Mei, Xuehua Shen , and ChengXiang Zhai. “Automatic Labelling of Multimonomial Topic Models.” Proceedings of ACM KDD 2007,DOI=10.1145/1281192.1281246.
[3]: Introduction to information retrieval , book by Christopher D. Manning, Hinrich Schütze, and Prabhakar Raghavan.
That’s all for today .💃🏻
Thank you for reading!