Latent Semantic Indexing in Python

Published in

Betacom

7 min readSep 28, 2020

Introduction

In the previous article (BOW + TF-IDF in Python for unsupervised learning task) we discussed the bag of words and tf-idf methods and used them to solve an unsupervised machine learning task.

The aim of this article is to improve the model defined using the BOW and tf-idf methods. To do so, we will introduce an indexing and retrieval method: the Latent Semantic Indexing (LSI). It uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in a corpus (unstructured collection of documents).

If you are already familiar with what LSI is and how it works, you can skip the first two sections and go directly to the example task.

What is LSI?

Latent Semantic Indexing is a common technique in the NLP field. It is used to analyze relationships between a set of documents and the terms they contain in order to produce a set of concepts related to the documents and terms.

Recall the vector space representation of documents and queries introduced in the previous article. This representation has a number of advantages:

uniform treatment of queries and documents as vectors,
induced score computation based on cosine similarity,
ability to weight different terms differently,
applications as clustering and classification.

Why do we need LSI then?

The vector space representation is unable to cope with two classical problems that arise in natural languages: synonymy and polysemy.

Synonymy refers to a case where two different words (say “present” and “gift”) have the same meaning. The vector space representation fails to capture the relationship between synonymous terms such as “present” and “gift” since each of them is a separate dimension in the vector space. Suppose we have a document d = “Alex gave me a gift even if I didn’t buy her a present” and a query q = “gift”. The computed similarity between q and d will underestimate the true similarity that a human would detect.

On the other hand, polysemy refers to the case where a term such as “solution” has multiple meanings. Let d be the document “I wrote the solution of problem 5 on this paper”, then the computed similarity between d and the query q = “solution” overestimates the similarity that a human would perceive. We could use the co-occurrences of terms (whether, for instance, “solution” occurs in a document containing “problem” versus in a document containing “chemistry”) to capture the latent semantic associations of terms and alleviate these problems.

How does LSI work?

LSI is based on the distributional hypothesis which states that words that are close in meaning will occur in similar pieces of text.

The starting point is the TF-IDF representation matrix A of the distribution of the words within the set of documents. It is a m by n matrix where m is the number of unique words and n is the corpus cardinality. The element aᵢⱼ represents the frequency of the word i in the document j.

A mathematical technique called Singular Value Decomposition (SVD) is consequently applied to the matrix A in order to reduce the dimensionality of the data.

There are several reasons behind this step:

the original matrix A is presumed to be too large to be computationally addressed;
reducing the dimensionality helps remove some of the noise inherently present in the original matrix;
the matrix A is sparse, so it is possible to significantly reduce the cardinality and at the same time preserve much of the information.

The SVD computes the term and document vector spaces by approximating the single term-frequency matrix A as follows:

where T is the m by r term-concept vector matrix, S is the r by r singular values matrix, D is the n by r concept-document vector matrix, such that

The next step is to truncate the SVD and keep only the largest k « r diagonal entries in the singular value matrix S, where k is typically on the order 100 to 300 dimensions. This effectively reduces the term matrix T size to m by k and the document matrix D size to n by k.

The combination of the SVD operation and this reduction reduces noise and other undesirable artifacts of the original space of A while preserving the most important semantic information in the text. The reduced set of matrices is often denoted with a modified formula such as:

The LSI technique can be implemented in Python using the gensim.models.LsiModel. For full documentation please check this page.

Problem description

The task we will solve is the same as the one from the previous article.

We will perform document similarity between movies plots and a given query.

We will use the Wikipedia Movie Plots Dataset which is available at this page and consists in ~35000 movies.

Our solution

Let’s start installing the latest version of gensim and import all the packages we need.

!pip install — upgrade gensim
import pandas as pd
import gensim
from gensim.parsing.preprocessing import preprocess_documents

We can now load the dataset and store the plots into the corpus variable.

In order to avoid RAM saturation, we will only use movies with release year ≥ 2000.

df = pd.read_csv(‘wiki_movie_plots_deduped.csv’, sep=’,’)
df = df[df[‘Release Year’] >= 2000]
text_corpus = df[‘Plot’].values

The next step is to preprocess the corpus. Please refer to this article for full explanation of this operation.

processed_corpus = preprocess_documents(text_corpus)
dictionary = gensim.corpora.Dictionary(processed_corpus)
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]

We will now load the tfidf model from the gensim library and use it to represent the corpus in the new vector space.

tfidf = gensim.models.TfidfModel(bow_corpus, smartirs=’npu’)
corpus_tfidf = tfidf[bow_corpus]

The LSI model will now be initialized using gensim.models.LsiModel and the corpus will be indexed, in preparation for similarity queries.

We will use the following parameters:

corpus = stream of document vectors or sparse matrix of shape num_term (m) by num_documents (n),
num_topics = number of requested factors (the latent dimensions k introduced in the previous section).

lsi = gensim.models.LsiModel(corpus_tfidf, num_topics=200)
index = gensim.similarities.MatrixSimilarity(lsi[corpus_tfidf])

Let’s now write the plot of the movie we would like to get from the similarity query. Suppose new_doc is the string containing the movie plot. The code to find the 10 most similar movies to it is the following:

new_doc = gensim.parsing.preprocessing.preprocess_string(new_doc)
new_vec = dictionary.doc2bow(new_doc)
vec_bow_tfidf = tfidf[new_vec]
vec_lsi = lsi[vec_bow_tfidf]sims = index[vec_lsi]for s in sorted(enumerate(sims), key=lambda item: -item[1])[:10]:
    print(f”{df[‘Title’].iloc[s[0]]} : {str(s[1])}”)

We ran the code above for the following movies:

Inception: “infiltrate minds and extract information through a shared dream world. different levels of dreams. they want to implant an idea”,
Wreck-It Ralph: “In the arcade at night the video game characters leave their games. The protagonist is a girl from a candy racing game who glitches”,
Kill Bill: “Blonde bride goes to Japan and kills many people”,
Star Wars: “knight lightsaber starship”.

The results obtained were not good:

Inception was not on the top ten matches;
Wreck-It Ralph was the fifth match found;
the Kill Bill movies were on the first and second positions;
the Star Wars movies were in the 6th, 37th, 41th, 49th, 83th, 86th and 698th positions.

We therefore decided to modify the num_topics parameter from the LSI model. The results are summarized in the following table, where each cell represents the movie position obtained with the respective num_topic parameter.

As a consequence, we decided to set num_topics = 1000 and run the query for a new movie: “Billy Elliot”.

We use the plot “Boy studies ballet in secret. His father wants him to go to the gym and boxe. They raise money for audition in London” and got the following results:

Conclusions

Thanks to the Latent Semantic Indexing we were able to improve the model based on the bag of words and tf-idf representations and take care of the synonymy and polysemy problem.

We also saw how the best results for our task were obtained with num_topics equal to 1000, but note that there are techniques to tune this hyperparameter. One of them is based on the Topic Coherence measure and will be discussed in a later article.