Topic Modeling: context and latent semantic indexing.

Mohammad Derakhshan
3 min readApr 29, 2022

--

In the beginning, we started analyzing text by considering a single word. the idea behind that was that each word has meaning, and it can be approximated by the frequency of the word in a document. We thought if a query share more terminology with a record, they would be close in terms of meaning. But now, we know that may a single word has meaning, but it doesn’t imply it is close to the concept of the query because what we call meaning usually consists of multiple words. Besides, a single word may have various meanings, or different words can express a concept!

So, there is no way to capture the meaning of a document just by looking at a single word. If we think about how we understand the meaning, we realize the intention is not obtained just by looking at a single word but by how different words are put together. We still can consider each word has meaning, but this meaning should be inferred by considering the context!

Topic modeling captures the dimension of meaning that depends not on a single word but a combination of words for a given context. We can think of purpose as an invisible glove that keeps the terms together in a document. Latent Semantic Indexing (LDI) is a way to see this glove! The idea of using “meaning” is if we can express the meaning of a document in a couple of words, we can substitute the meaning with document words and limit query search just to these words. of course, we lose interpretability in this case.

The backbone of LDI is the matrix factorization technique. Assume we have a matrix where rows are documents and columns are all unique words we have in the corpus. If we multiply this matrix by its transpose matrix, we get another square matrix, word per word or document per document. The critical part is the value of the matrix. If the matrix is a term per term matrix, the elements are proportional to the relation among the words motivated by the documents. If we consider document per document matrix, we find the connection between documents inspired by words. in this step, we need a quick recap about eigenvalue and eigenvector:

Image from IR course, Alfio Ferrara, University of Milan

In a simple word, we can rebuild a matrix by producing the eigenvalues and eigenvectors. But there is a question here. Is the size of λ related to something? As much as λ is higher, the information provided by the matrix is higher. So if, for example, we have three eigenvalues and three eigenvectors and get rid of one from each, you have an approximate of the original matrix. So in other terms, eigenvalues are somehow the synthetic representation of information provided by a matrix.

Now, you can guess what we’re going to do. We want to find the eigenvalues of the term per term matrix, get the top K ones and use them as latent variables we’re looking for.

One note worth mentioning here is that eigenvectors are perpendicular to each other. This means they are linearly independent.

Another concept we need to explain here is Singular Value Decomposition (SVD). You can think of SVD as a data reduction tool where you have high dimensional data, and you want to reduce it to “K” features which are vital to describe these data. SVD is the basis of “Principal Component Analysis” (PCA), widely used for dimension reduction based on correlation. I won’t enter to story behind SVD, but our interpretation of this topic for LSI is detected in the picture below:

Image from IR course Alfio Ferrara, the University of Milan

This was the theory part of LSI and its usage. In the following article, we will do some coding and use LSI to retrieve relevant documents. Stay tuned!

--

--

Mohammad Derakhshan

Hi! I'm Mohammad. A master's student at the University of Milan. I am an android expert who loves NLP! You can search for me on LinkedIn to make a connection!