From LDA to Short text topic modeling — Negative sampling and Quantization Topic Model(NQTM)

Published in

Stellia.ai

9 min readSep 30, 2021

Introduction

What is topic modeling? An example can be: there are a set of articles, but on what general topics are they belong to, what are these topics? The Machine learning(ML) models that answer these questions are topic modeling models.

Topic modeling has been used in many applications. They are vastly discussed online. Especially for the statistics-based ones, in which Latent Dirichlet Allocation (LDA) can be one of the representatives. These statistics method-based models are widely used and proofed to be very successful in many fields.

Though the success of LDA type of models in topic modeling. They are proofed to be not very effective in a short text. While short text messages are playing more and more important roles nowadays. For example, text messages, Twitter messages, movie or product reviews, etc.

NQTM is a very powerful short text topic modeling model, it can be trained through gradient descent methods like other ML models. But in order to really understand and even change the code of this model, the understanding of the mechanics of LDA will be very helpful

In this blog we are going to focus on the following points:

How LDA really works in math and in programing
How people can extract topics with unsupervised machine learning models? Where and how does the model get topics? Remember no tagged data are used
How NQTM works?
Extra codes that are not provided in the official git repository that give the text and topic pairs.

How LDA really works in math and in programing

LDA is an unsupervised model, in this case, the training process is also the prediction process. Essentially, a simplified way of thinking about this process, for each document,

LDA is computing the probability of different topics with respect to this document. For example, if we only have three topics: (1, history), (2, art), (3, science), then Doc-A many have the following outputs after being processed with LDA (0.2,0.2,0.6), then topic three(science) is the one holds the highest probability as a topic for Doc-A
LDA also computes the matrix that holds the probability distribution P(word w| topic t). It tells for each of the topics in this matrix(say row zero), what are the probability that this topic holds this word. For example, topic science very like has a high probability of the words, like physics, experiment, theory, etc.

Of course, there are more steps in the real algorithm even before passing the documents to LDA, for example, tokenization, removing stop words, do lemmatization, etc.

Then the next question: in the real case, How LDA is really got set up?

The first question you may have is what is “Dirichlet” in LDA exactly means.

The Dirichlet distribution is a statistic distribution, used in modeling the topic distribution in Document, which, we put the math definition in Appendix.

Next, as mentioned above, the two main task people use LDA is:

Get the top topics among all of the documents P(topic | Doc)
For each of the documents, what are the main topics belong to it P(topic|Doc)* P(word| topic)

This is also the final output for other topic modeling models.

A good way to understand how LDA works is to think it reversely, how to use the two distribution P(topic | Doc) and P(word | topic) to reconstruct the documents

LDA is a generative probabilistic model. The basic assumption in LDA is that Documents can be represented by random mixtures over latent topics, and topics is characterized by a distribution over words. Here the distribution of Document over topics in LDA follows the Dirichlet distribution. In steps[2]:

LDA assumes the following generative process for documents

Choose N ∼ Poisson(ξ).
Choose θ ∼Dir(α).
For each of the N words wₙ : (a) Choose a topic zₙ ∼ Multinomial(θ). (b) Choose a word wₙ from $p(wₙ ∣ zₙ, β), a multinomial probability conditioned on the topic
zₙ

If we put the above steps in the math format, it will be the following:

For the parameters mentioned above, choose θ from the Dirichlet distribution with the given α and a selected number of topics k. The word probabilities are parameterized by a K x V matrix β which needs to be determined later. N is independent of all the other data-generating variables, it is an ancillary variable.

The details are shown in A.3 of Reference [2]. By computing the derivatives of the KL divergence and setting them equal to zero, we can get the following relationship between ϕ and γ

With the above relationship, the optimization of parameters are document-specific by the following[2]:

So till this far, it is clear how the LDA is performed and how the optimization is done in real cases.

Negative sampling and Quantization Topic model (NQTM)

As mentioned at the most beginning, the motivation of NQTM is dealing with a short text. NQTM is an auto-encoding framework that addresses the un-supervised short text modeling problem. In this model two essential steps are done:

mapping topic distribution into an appropriate defined embedding space.
A new negative sampling decoder to improve the topic diversity performance.

The performance of NQTM is very impressive compared with the unsupervised topic modeling models[3].

The evaluation matrix used here are topic coherence($C_v$) and unique score(TU)

Topic coherence metrics depend on co-occurrences of topic words learned by models in the external corpus assuming that coherent words should co-occur within a certain distance[4]. It has been proven to perform better than other coherence metrics like widely used NPMI, UMASS. Details can be seen in either reference 3 or reference 4

TU is used for topic diversity evaluation. The score ranges from 1/k to 1 and a higher value means the generated topics are more diverse due to fewer duplicated words across other topics.

Usually, in general higher TU will cause lower Cᵥ as coherent words seldom repeat, higher Cᵥ leads to lower TU as coherent words frequently repeat across topics.[3]

NTQM model structure and processing steps

In NQTM, MLPs(multilayer perceptron) are used in both encoder and decoder. Also, the model assumes that short text x is in the form of a bag of words which produces continuous

There are mainly three steps of the model:

Short text Encoder
Topic Distribution Quantization
Negative sampling Decoder

For the short text encoder, In the adoption of a simple network structure, θₑ is calculated as follows

After the encoding step, the next step is the topic distribution quantization. A discrete embedding space e=(e₁, e₂, …, e) ∈ Rᴷˣᴮ is set. K is the number of topics and B is the size of the space. To encourage the maximum of the distance between embedding vectors, the embedding matrix is extended, the first K vectors are initialized with identity matrix and additional set, and the (eᴷ⁺¹ …eᴮ) are initialized with uniform unit scaling:

This representation of the embedding matrix is optimized during the training.

The short x is encoded to the θₑ and then mapped to the embedding space as:

θᵩ is changed to the eₖ which holds the closest distance. Then the new θᵩ can be passed to the decoder steps, which are also MLPs.

In order to generate more diverse topics, a negative sampling scheme is formulated. The words with high probabilities in other topics but not assigned to the current short text are taken as negative samples. Specifically, if top t topics are chosen, then the model removes the top t and samples one negative topic Zₙ from the left (k-t) with equal probability

Then generate M words from the Zₙ, as the words that decoder should not include them since they are not from the top t topics.

The loss function has two components, the one from top t and the negative contributions.

here x⁽ᶦ⁾ refers to the i-th short text in the corpus. Together with the l₂ regularization, give the final form of the loss function.

These are the background of this model. In the real application, the object is to get the collection of topics and also the topics for each of the input short texts. With the embedding space, NQTM first gets the top t embedding of topics for each short text inputs and then exact the topic words associated with it. This is very similar to the two matrices that LDA used to do the generation.

NQTM using example

input text should be preprocessed before passing to the model, the steps are the following:

Details in processing:

tokenize each text and remove non-Latin characters and stop words by using NLTK
filter out short texts with length less than 2
remove words with document frequency less than 5
convert all letters into lower cases
count vector from sklearn

Note: more detailed procedures are in the file utils/preprocess.py

Following the instruction on the official github page(link), the model could be run but the model only gives the collection of topic words for all of the documents, not in the form of (Doc, topic) pair.

An extra function is needed to be added to the package. In the function run_NQTM.py, adding in the following code, then a file will be generated with a selected number of topics and topic words associated with each doc when called in the main function.

def print_topic_words(beta,theta, feature_names,n_topic_selection=2, n_top_words=2):
    top_words = list()
    doc_topic_word=list()
    for i in range(len(beta)):
        top_words.append(" ".join([feature_names[j] for j in beta[i].argsort()[:-n_top_words - 1:-1]]))
    for i in range(len(theta)):
        doc_topic_word.append(" ".join([top_words[j] for j in theta[i].argsort()[:-n_topic_selection - 1:-1]]))    with open(os.path.join(args.output_dir, 'topic_words_T{}_K{}_{}th'.format(n_top_words, args.topic_num, args.test_index)), 'w') as file:
        for line in doc_topic_word:
            file.write(line + '\\n')

Appendix session:

what is Dirichlet distribution:

In probability and statistics, the Dirichlet distribution, often denoted as Dir(a), is a continuous multivariate probability distribution parameterized by a vector α. The mathematics definition is as follows:

The Dirichlet distribution of order K ≥ 2 with parameters a₁, …, a>0 has a probability density function with respect to Lebesgue measure on the Euclidean space Rᴷ⁻¹ given by

B(a) is the normalization factor and is in a complex express, but the key idea of Dirichlet distribution is relatively simple as shown above, by changing α, then the shape of the distribution also changed [1]

Reference

Wikipedia, Dirichlet distribution, https://en.wikipedia.org/wiki/Dirichlet_distribution
Latent Dirichlet Allocation, Journal of Machine Learning Research 3 (2003) 993–1022, David M. Blei, Andrew Y.Ng, Michael I. Jordan
Short Text Topic Modeling with Topic Distribution Quantization and Negative Sampling Decoder, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 1772–1782, November 16–20, 2020. c 2020 Association for Computational Linguistics Xiaobao Wu, Chunping Li, Yan Zhu, Yishu Miao
Michael R¨oder, Andreas Both, and Alexander Hinneburg.2015. Exploring the space of topic coherence measures. In Proceedings of the eighth ACM international conference on Web search and data mining, pages 399–408. ACM.