Training a Doc2Vec Model for Document Classification

4 min readMay 19, 2019

Introduction

Word embeddings are a newly discovered way of representing a word in a low-dimensional space. They provide a vectorial representation of words that bears any semantics or syntax.

In this story, usage of doc2vec vectors and logistic regression in order to classify documents is discussed. This story is derived from an NLP assignment report given by Necva Bölücü.

Reading the Input File

Given dataset of movie plots has been read using the built-in csv module. After omitting the first line that describes columns, first 2000 movie genres are saved as the training set and the remaining movie genres are saved to be used as the test set while classifying plots. All of the input lines are saved in order to create vectors later.

Cleaning up the Movie Plots

All of the movie plots are cleaned up before training a model to create document vectors, using the following methods:

Plots are tokenized using .word_tokenize() method of the nltk module.
Stopwords are removed from the plot using a slightly customized list of stopwords. (See references section for details)
Punctuation and numbers are removed from the plots.

After this operation, plots are converted into TaggedDocuments in order to be used while creating a model.

Training a Model

A Doc2Vec model is instantiated using the following parameters:

Vector size: 300
Epochs: 50
Window size: 5
Minimum Frequency: 5
Training Algorithm: Distributed bag of words

Each of those values are determined using trial and error in order to get a higher accuracy score.

After instantiation, the model is trained (by using .build_vocab() and .train() methods of the model) using the tagged documents created from given the movie plots and saved as a file (by using the .save() method of the model) for future usage.

Logistic Regression Classifier

A logistic regression classifier (of the sklearn module) is instantiated using the following parameters:

solver: lbfgs
multi_class: auto
max_iter: 1000
tol: 0.5

Those parameters are changed from default because of the warnings from the classifier except the tol parameter. Each of those values are determined using trial and error in order to get a higher accuracy score.

By using the movie plot vectors obtained from a previously trained model and correct categories of those plots, the logistic regression classifier is trained (by using the .fit() method of the classifier).

Calculating the Accuracy

After the training operation, categories of the test vectors are predicted (by using the .predict() method of the classifier) and accuracy of the model is calculated according to the below formula:

Results

Using the formula mentioned in the above section, 7 of the created models have been sampled and evaluated. As can be seen from the above plot, their accuracy values vary. This is because Doc2Vec uses randomization while creating document vectors. In order to give a more valid result, mean of the obtained accuracy values has been calculated shown in the below figure.

Apart from those samples, a maximum accuracy value of 54.46% is obtained by trying to create a model multiple times.

Source Code

Source code for the whole assignment including this story and the other one can be found on Github:

cakinalperen/NLP-Assignment-3

Contribute to cakinalperen/NLP-Assignment-3 development by creating an account on GitHub.

github.com

References

Dataset:

tagged_plots_movielens.csv

Edit description

drive.google.com

Doc2Vec documentation:

gensim: topic modelling for humans

Efficient topic modelling in Python

radimrehurek.com

Logistic regression documentation:

sklearn.linear_model.LogisticRegression - scikit-learn 0.21.1 documentation

In the multiclass case, the training algorithm uses the one-vs-rest (OvR) scheme if the 'multi_class' option is set to…

scikit-learn.org

Logistic regression demonstration:

Logistic Regression using Python (scikit-learn)

One of the most amazing things about Python’s scikit-learn library is that is has a 4-step modeling pattern that makes…

towardsdatascience.com

List of english stopwords:

Cleaning up text tutorial:

How to Clean Text for Machine Learning with Python

You cannot go straight from raw text to fitting a machine learning or deep learning model. You must clean your text…

machinelearningmastery.com