Training a Doc2Vec Model for Document Classification

Alperen Çakın
4 min readMay 19, 2019

--

Introduction

Word embeddings are a newly discovered way of representing a word in a low-dimensional space. They provide a vectorial representation of words that bears any semantics or syntax.

In this story, usage of doc2vec vectors and logistic regression in order to classify documents is discussed. This story is derived from an NLP assignment report given by Necva Bölücü.

Reading the Input File

Given dataset of movie plots has been read using the built-in csv module. After omitting the first line that describes columns, first 2000 movie genres are saved as the training set and the remaining movie genres are saved to be used as the test set while classifying plots. All of the input lines are saved in order to create vectors later.

Cleaning up the Movie Plots

All of the movie plots are cleaned up before training a model to create document vectors, using the following methods:

  1. Plots are tokenized using .word_tokenize() method of the nltk module.
  2. Stopwords are removed from the plot using a slightly customized list of stopwords. (See references section for details)
  3. Punctuation and numbers are removed from the plots.

After this operation, plots are converted into TaggedDocuments in order to be used while creating a model.

Training a Model

A Doc2Vec model is instantiated using the following parameters:

  • Vector size: 300
  • Epochs: 50
  • Window size: 5
  • Minimum Frequency: 5
  • Training Algorithm: Distributed bag of words

Each of those values are determined using trial and error in order to get a higher accuracy score.

After instantiation, the model is trained (by using .build_vocab() and .train() methods of the model) using the tagged documents created from given the movie plots and saved as a file (by using the .save() method of the model) for future usage.

Logistic Regression Classifier

A logistic regression classifier (of the sklearn module) is instantiated using the following parameters:

  • solver: lbfgs
  • multi_class: auto
  • max_iter: 1000
  • tol: 0.5

Those parameters are changed from default because of the warnings from the classifier except the tol parameter. Each of those values are determined using trial and error in order to get a higher accuracy score.

By using the movie plot vectors obtained from a previously trained model and correct categories of those plots, the logistic regression classifier is trained (by using the .fit() method of the classifier).

Calculating the Accuracy

After the training operation, categories of the test vectors are predicted (by using the .predict() method of the classifier) and accuracy of the model is calculated according to the below formula:

Results

Using the formula mentioned in the above section, 7 of the created models have been sampled and evaluated. As can be seen from the above plot, their accuracy values vary. This is because Doc2Vec uses randomization while creating document vectors. In order to give a more valid result, mean of the obtained accuracy values has been calculated shown in the below figure.

Apart from those samples, a maximum accuracy value of 54.46% is obtained by trying to create a model multiple times.

Source Code

Source code for the whole assignment including this story and the other one can be found on Github:

References

Dataset:

Doc2Vec documentation:

Logistic regression documentation:

Logistic regression demonstration:

List of english stopwords:

Cleaning up text tutorial:

--

--