Tf-idf and doc2vec hyperparameters tuning

Eleonora Fontana
Betacom
Published in
8 min readNov 9, 2020
Photo by Roberta Sorge on Unsplash

Introduction

In this article we will continue the discussion about hyperparameters tuning, referring to the problem and models described in our previous studies that are available at Betacom — Medium.

In particular, we will analyze the tf-idf and doc2vec models. The method we will use to choose the best parameters for the models will be explained in the first section. In the last two sections we will apply the method to the two models and discuss the results.

The tuning method

In the previous article we discussed hyperparameters tuning for the LSI model using the Topic Coherence measure. It is a metric that aims to emulate human judgment in order to determine the number of topics within a given corpus i.e. the num_topics parameter which defines the LSI model.

Unfortunately, there is no analytical method to determine the best parameters to use for the tf-idf and doc2vec models. Thus we will define a semi-objective way to determine models performances.

With an unsupervised learning model, such as our tf-idf and doc2vec implementations, it is often a problem to define a metric. The key point is that any metric we may choose, it can’t be computed based on human labelled data and it doesn’t depend on the model itself, rather than the application scenario. Our metric will then be based on our fictional application goal of movie title retrieving engine.

First of all, we will choose a dataset on which to test the models as their parameters vary. For each model, the results obtained on the chosen dataset will then be evaluated: taken the top n results, if the correct result is in position p then n-p+1 will be added to the overall score of the model. Let’s take a look at an example to understand how the score is computed.

The models study will be based on the movie-plot task described in the previous articles. We will use the Wikipedia Movie Plots Dataset which is available at this page and consists in ~35000 movies. Let’s see how to compute the score of a given model.

Suppose the test dataset is composed solely of the following plot: “School of magic, fight a Dark Wizard, quidditch matches”. Of course the movie we are looking for is one from the Harry Potter saga. Let’s assume we ran the query and got the following top 5 results:

  1. Harry Potter and the Philosopher’s Stone
  2. Harry Potter and the Prisoner of Azkaban
  3. Django Unchained
  4. Harry Potter and the Goblet of Fire
  5. Harry Potter and the Deathly Hallows — Part 2

The correct results are on the 1st, 2nd, 4th and 5th positions. Thus, the model score will be computed as:

In our case n is equal to 5 since we chose the top 5 results, thus the model score will be 12. Once the score for each model has been calculated, we will choose the hyperparameters corresponding to the highest score.

The tuning method in Python

Let’s see how to implement in Python the method described above.

First of all, we need to install and import all the packages we will use.

!pip install --upgrade gensim
import re
import gensim
import itertools
import pandas as pd
from gensim.models import TfidfModel
from gensim.similarities import MatrixSimilarity
from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from gensim.parsing.preprocessing import preprocess_documents
from gensim.parsing.preprocessing import preprocess_string

We can now load the dataset and store the plots into the corpus variable. In order to avoid RAM saturation, we will only use movies with release year ≥ 2000. The corpus is also preprocessed in order to be fed to the models.

df = pd.read_csv('wiki_movie_plots_deduped.csv', sep=',')
df = df[df['Release Year'] >= 2000]
corpus = df['Plot'].values
processed_corpus = preprocess_documents(corpus)

The dataset we will use to evaluate the models performances will be the following. Please note we use regular expressions for the movie title since some of the plots could match more movies from a specific saga.

moviesVal = [{'plot': 'infiltrate in minds and extract information through a
shared dream world. different levels of dreams',
'titleRegex': '^Inception$'},
{'plot': 'In the arcade at night the videogame characters leave
their games. The protagonist is a girl from a candy racing
game who glitches',
'titleRegex': '^Wreck-It Ralph$'},
{'plot': 'Blonde bride goes to Japan and kills many people',
'titleRegex': '^Kill Bill Volume (1|2)$'},
{'plot': 'Jedi knight lightsaber starship',
'titleRegex': '^Star Wars: (.*)$'},
{'plot': "Boy studies ballet in secret. His father wants him to go
to the gym and boxe. They raise money for audition
in London",
'titleRegex': "^Billy Elliot(.*)$"}
]

The parameter we will work on for the tf-idf model is the smartirs one. SMART (System for the Mechanical Analysis and Retrieval of Text) Information Retrieval System is a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form XYZ where the letters represent the term weighting of the document vector. The three letters correspond respectively to the following concepts.

  1. Term frequency weighting. Recall that the term frequency is the normalized count of terms in a given document. This value can be set to:
    • b - binary,
    • t or n - raw,
    • a - augmented,
    • l - logarithm,
    • d - double logarithm,
    • L - log average.
  2. Document frequency weighting. Recall that the document frequency is the number of documents in a corpus that contain a given term. This value can be set to:
    • x or n - none,
    • f - idf,
    • t - zero-corrected idf,
    • p - probabilistic idf.
  3. Document normalization. Each document is normalized so that all document vectors are turned into unit vectors. In doing so, we eliminate all information on the length of the original document; this masks some subtleties about longer documents. First, longer documents will — as a result of containing more terms — have higher term frequency values. Second, longer documents contain more distinct terms. The document normalization can be set to:
    • x or n - none,
    • c - cosine,
    • u - pivoted unique,
    • b - pivoted character length.

A quite exhaustive description can be found on the Wikipedia page.

Let’s now define a list with all the possible combinations for the smartirs parameter.

termfreq = ['b', 'n', 'a', 'l', 'd', 'L']
docfreq = ['n', 'f', 'p']
docnorm = ['n', 'c', 'u', 'b']
smartirsList = [''.join(comb) for comb in
list(itertools.product(*[termfreq,
docfreq,
docnorm]))
]

The evaluation function for the tf-idf model will then be the following.

def evaluation_tfidf(dataList, params, processed_corpus, top=10):
# preprocess corpus
dictionary = gensim.corpora.Dictionary(processed_corpus)
bow_corpus = [dictionary.doc2bow(text) for text in
processed_corpus]
# preprocess dataset
for movie in dataList:
tmp = preprocess_string(movie['plot'])
movie['plot_bow'] = dictionary.doc2bow(tmp)
movie['regex'] = re.compile(movie['titleRegex'])
score = {}
for param in params:
index = None # avoid RAM saturation
score[param] = 0
try:
tfidf = TfidfModel(corpus=bow_corpus,
dictionary=dictionary,
smartirs=param)
index = MatrixSimilarity(tfidf[bow_corpus])
for movie in dataList:
new_vec = movie['plot_bow']
vec_bow_tfidf = tfidf[new_vec]
sims = index[vec_bow_tfidf]
topSims = sorted(enumerate(sims),
key=lambda item: -item[1])[:top]
for i in range(len(topSims)):
if movie['regex'].match(df['Title'].iloc[topSims[i][0]]):
score[param] = score[param] + (top - i)
except Exception as error:
print(f'Cannot evaluate model with samrtirs={param}
because of error: {error}')
continue
return score

The parameters for the doc2vec model that we will vary are the following.

  • dm: it defines the training algorithm. If dm=1, PV-DM is used. Otherwise, PV-DBOW is employed.
  • vector_size: dimensionality of the feature vectors.
  • window: the maximum distance between the current and predicted word within a sentence.
  • hs: if 1, hierarchical softmax will be used for model training; if set to 0, and negative is non-zero, negative sampling will be used.
dm = [1, 0]
vector_size = [10, 20, 50, 70, 100, 150, 200]
window = [1, 2, 3, 4, 5]
hs = [1, 0]
paramsList = [{'dm': item[0],
'vector_size': item[1],
'window': item[2],
'hs': item[3]
} for item in
list(itertools.product(*[dm,
vector_size,
window,
hs]))
]

The evaluation function for the doc2vec model will be the following.

def evaluation_doc2vec(dataList, params, processed_corpus, top=10):
# preprocess corpus
tagged_corpus = [TaggedDocument(d, [i]) for i, d in
enumerate(processed_corpus)]
# preprocess the test dataset
for movie in dataList:
movie['plot_preproc'] = preprocess_string(movie['plot'])
movie['regex'] = re.compile(movie['titleRegex'])
scoreList = []
for param in params:
param['score'] = 0
model = None
try:
model = Doc2Vec(tagged_corpus,
dm=param['dm'],
vector_size=param['vector_size'],
window=param['window'],
min_count=1,
epochs=10,
hs=param['hs'])
for movie in dataList:
new_doc = movie['plot_preproc']
test_doc_vector = model.infer_vector(new_doc)
sims=model.docvecs.most_similar(positive=[test_doc_vector])
topSims = sims[:10]
for i in range(len(topSims)):
if movie['regex'].match(df['Title'].iloc[topSims[i][0]]):
param['score'] = param['score'] + (top - i)
scoreList.append(param)
except Exception as error:
print(f'Cannot evaluate model with parameters {param}
because of error: {error}')
continue
return scoreList

Models evaluation

Let’s now evaluate the tf-idf model and print the parameters and their score.

score_tfidf = evaluation_tfidf(moviesVal,
smartirsList,
processed_corpus)
# sorting scores and printing them
score_tfidf = {k: v for k, v in sorted(score_tfidf .items(),
key=lambda item: -item[1])}
for s in score_tfidf:
print(s, sortedScore[s])

The results are shown in the following table.

As you can see, the model with ‘npu’ smartirs that we used in our first article (BOW + TF-IDF in Python for unsupervised learning task) is the third best model. The highest scores are very close to each other and it means that the model we chose appears among the top scorers. The more examples we use, the more significant the results obtained are. Indeed, if we use only one movie to evaluate the models the results will be misleading. Let’s try it.

newMovie = [
{'plot': "Boy studies ballet in secret. His father wants him to go
to the gym and boxe. They raise money for audition
in London",
'titleRegex': "^Billy Elliot(.*)$"}
]
newMovieScore = evaluation_tfidf(newMovie,
smartirsList,
processed_corpus)
newsortedScore = {k: v for k, v in sorted(newMovieScore.items(),
key=lambda item: -item[1])}
for s in newsortedScore:
print(s, newsortedScore[s])

The results show how using only one movie is not a good way to evaluate the model performance. The larger the test dataset is, the more accurate the models’ evaluation is.

Let’s now evaluate the doc2vec performance.

score_doc2vec = evaluation_doc2vec(moviesVal,
paramsList,
processed_corpus)
score_doc2vec = pd.DataFrame(score_doc2vec)
score_doc2vec = score_doc2vec.sort_values(by=['score'],
ascending=False)
print(score_doc2vec.head(15))

These scores show that the best parameters value are:

  • dm = 0,
  • vector_size between 70 and 100,
  • window ≥ 3,
  • hs = 1.

In order to get more accurate values, we can perform the hyperparameters tuning setting dm=0 and hs=1 and varying vector_size and window in a smaller interval than before. Please also note that we trained the models for only 10 epochs because of time reasons, but increasing this value to 50 or 100 will return more reliable results.

Conclusions

Performing the method several times, each time varying the parameters values in a smaller interval than before, will also return more accurate values than performing it only one time.

Feel free to modify the code in order to improve the method accuracy and retrieve more reliable results.

--

--