Hyperparameters tuning — Topic Coherence and LSI model

Published in

Betacom

5 min readOct 26, 2020

Introduction

In this article we will start a discussion about hyperparameters tuning, referring to the problem and models described in our previous studies that are available at Betacom — Medium.

The models which will be analyzed in terms of hyperparameters tuning will be the following:

term frequency — inverse document frequency (tf-idf),
Latent Semantic Indexing (LSI),
doc2vec.

We will begin with the theoretical discussion of a fascinating aspect that impacts LSI. The other two models will be discussed later.

LSI hyperparameter

The LSI method, as already described in Latent Semantic Indexing in Python | Betacom, is defined using the num_topics parameter. It corresponds to the number of requested factors (latent dimensions) and represents the number of topics in the given corpus.

model = gensim.models.LsiModel(corpus, num_topics=k)

In linguistics, the topic of a sentence is what is being talked about. Choosing too few topics will produce results that are overly broad, while choosing too many will result in the “over-clustering” of a corpus into many small highly-similar topics. Fo example, if we consider the Wikipedia entries we can identify the following topics: mathematics, history, physics, philosophy, geography. Whereas, if we limit ourselves to mathematics then the topics are algebra, analysis, statistics, etc.

The num_topic parameter can be tuned using the Topic Coherence measure. If you are already familiar with it, feel free to skip the next section.

Topic Coherence

Topic Coherence is a metric that aims to emulate human judgment in order to determine the number of topics within a given corpus i.e. the num_topics parameter which defines the LSI model.

A set of documents is said to be coherent, if they support each other. Thus, a coherent corpus can be interpreted in a context that covers all or most of the documents. An example of such a corpus is the following:

“the game is a team sport”
“the game is played with a ball”
“the game demands great physical efforts”.

Topic Coherence measures score a single topic by measuring the degree of semantic similarity between high-scoring words in the topic. Thus there exist different coherence measures, each of which uses a different definition of degree of semantic similarity and/or the way to combine them. In this article we will focus on two of the most common coherence measures: UMass and UCI, that we will call c_umassand c_uci.

Both c_umass and c_uci are based on the same high level idea: the topic coherence is the sum of the degree of semantic similarity (score) between frequent word pairs. The definition is the following:

where w₁,…,wₙ are the words used to describe the topic, usually the top n words by frequency. The difference between the two coherence types is the definition of score(wᵢ,wⱼ).

C_umass is based on document co-occurrence counts as it uses the following pairwise score function:

where D(wᵢ) is the count of documents containing the word wᵢ and D(wᵢ,wⱼ) is the count of documents containing both words wᵢ and wⱼ. The ε term is added to avoid log(0) evaluation.

C_uci uses as pairwise score function the Pointwise Mutual Information:

where p(wᵢ) represents the probability of seeing wᵢ in a random document, and p(wᵢ,wⱼ) the probability of seeing both wᵢ and wⱼ co-occurring in a random document.

In the next section we will define the topic coherence in Python using the gensim library. For full documentation please check this page.

Topic Coherence in Python

As stated in the gensim documentation, the UMass is the fastest method to evaluate topic coherence. Thus we will use it to compute the topic coherence measure for different LSI models in order to choose the best num_topics value.

The best topic coherence is achieved when there are few documents containing a pair wi, wⱼ since each wᵢ represents a different topic. Looking at the UMass topic coherence definition, it means that the best topic coherence value is achieved when D(wᵢ) » D(wᵢ,wⱼ) for each i, j i.e. the best value corresponds to the topic coherence minimum.

The models will be based on the movie-plot task described in the previous articles. We will use the Wikipedia Movie Plots Dataset which is available at this page and consists in ~35000 movies.

Let’s start installing the latest version of gensim and import all the packages we need.

!pip install --upgrade gensim
from gensim.parsing.preprocessing import preprocess_documents
from gensim.corpora import Dictionary
from gensim.models import TfidfModel, LsiModel
from gensim.models.coherencemodel import CoherenceModel
import matplotlib.pyplot as plt
import pandas as pd

We can now load the dataset and store the plots into the corpus variable. In order to avoid RAM saturation, we will only use movies with release year ≥ 2000.

df = pd.read_csv(‘wiki_movie_plots_deduped.csv’, sep=’,’)
df = df[df[‘Release Year’] >= 2000]
text_corpus = df[‘Plot’].values

The next step is to preprocess the corpus. Please refer to this article for full explanation of this operation.

processed_corpus = preprocess_documents(text_corpus)
dictionary = gensim.corpora.Dictionary(processed_corpus)
bow_corpus = [dictionary.doc2bow(text) for text in processed_corpus]

We will now load the tf-idf model from the gensim library and use it to represent the corpus in the new vector space.

tfidf = TfidfModel(bow_corpus, smartirs=’npu’)
corpus_tfidf = tfidf[bow_corpus]

Let’s now define a function to compute the topic coherence for a given num_topics value and apply it to the num_topics values used in Latent Semantic Indexing in Python | Betacom.

def compute_coherence_UMass(corpus, dictionary, k):
   lsi_model = LsiModel(corpus=corpus_tfidf, num_topics=k)
   coherence = CoherenceModel(model=lsi_model,
                              corpus=corpus,
                              dictionary=dictionary,
                              coherence='u_mass')
   return coherence.get_coherence()coherenceList_UMass = []
numTopicsList = [20,100,200,300,400,500,800,1000, 1500]
for k in numTopicsList:
   c_UMass = compute_coherence_UMass(corpus_tfidf, dictionary, 10)
   coherenceList_UMass.append(c_UMass)

We can plot the results in order to determine which num_topics value corresponds to the minimum topic coherence.

As the figure shows, the best num_topics value is 1000. Indeed, the topic coherence values where the following:

Conclusions

In our previous article we assigned the value 1000 to num_topic parameter based on the fact that it gave the best average result to our queries, a rather empirical approach. Here we can see how that value was indeed a correct one, since it corresponded to the minimum of the Topic coherence measure.

The parameter tuning using the topic coherence is definitely faster and more accurate than just looking at the queries results.