Multi-Label Text Classification with StarSpace in R

Benjamin
Analytics Vidhya
Published in
4 min readFeb 20, 2020

In this post, we will predict movie genres based on the text describing the plot of the movie.

The data is from Wikipedia pages for each movie. The R package we use is the ruimtehol package which was created using methods developed by the Facebook AI Research team (StarSpace, TagSpace). You can find my code for this project here.

When and why use TagSpace? In this case, we have text data (film plots) and an abundance of labels (genres) that have not exactly been applied consistently.

A look at TagSpace ‘under the hood’

TagSpace was designed to use the free text from Facebook posts to predict the hashtags associated with those posts. We can extend their Use Case to ours. We have text in the plot descriptions and the genres are a lot like hashtags.

StarSpace is similar to TagSpace, but is more generalised. Whereas TagSpace allows us to create document embeddings from word embeddings, StarSpace allows us to create embeddings for anything. For example, a website might have an embedding generated from the text on the website and a user might have an embedding generated from the websites they visit.

Let’s get started!

You can download the data manually or use the following code.

Load the necessary libraries. The ruimtehol package provides the framework for us to use the StarSpace algorithms in R. It was created by Jan Wijffels who writes that ruimtehol is the translation of ‘star space’ into West Flemish.

There are a number of preprocessing steps before we can get to modelling. We need to join the plot description data to the genre data and clean them both up. Here is the text cleaning function used. For more details on cleaning and joining the data, or other part of the project, see my github repo here.

We use some popular text processing techniques like stemming and removing “stop words” (common words like “the”, “a” and “to”). There are other common techniques such as lemmatisation that could be used instead of stemming.

The film “Back to the Future” has labelled genres; science_fiction, adventure, comedy, family_film. The first line of the plot is “Seventeen-year-old Marty McFly lives with his bleak, unambitious family in Hill Valley, California.”. After processing, it will look like this; “seventeen marti mcfly live bleak unambiti famili hill vallei california”

A little bit of exploratory analysis never hurt, and often helps.

Genres

We can see the most popular genre labels are drama, comedy and romantic_film. Amongst the least popular can be seen a misspelling of comedy and romantic_thriller. If you’re wondering, the one film in the genre romantic_thriller is a Bollywood film called “Bloody Isshq”. Because some of the genres appear so few times, in this experiment we will only be predicting the top 50 genres.

The labels aren’t perfect. Consider genres related to romance.

The film “The Princess Bride” has the labels; comedy_of_manners, comedydrama, drama, comedy, romance_film, family_film and teen. It is labelled romance_film and drama but not romantic_drama. The TagSpace algorithm is especially good at dealing with this sort of fuzzy classification problem.

Words in Plot Description (after removing stop words)

“kill” appears to be the most common word in movie plot descriptions. Infer what you will about Human Nature and/or Movie Culture.

There are many words that only appear a few times in the entire dataset. We will need to take this into consideration when training the model.

Modelling

As we want to do classify text, we will use the embed_tagspace function. Let’s go through the parameters.

  • x is the cleaned text that the model will use as features. It should be a vector where each element is a string.
  • y is the labels that the model will try and predict. It should be a list where every element in the list is a vector with the genres.
  • dim is the dimension size of the document embedding generated for the text. Each plot description will be represented as a vector of length 20.
  • Epoch is the maximum number of epochs to train the model for. Internally, the embed_tagspace function creates a training and development set. When performance on the development set doesn’t improve after 10 epochs, the training is stopped. Our model usually ran in around 15 epochs.
  • lr is the learning rate for the neural network.
  • loss is the loss function. Another option instead of softmax is ‘hinge’.
  • negSearchLimit is the number of negatives labels sampled.
  • ws is the window size that the model looks at the text over.
  • minCount is the minimum number of times a word needs to occur in the entire corpus for the model to create an embedding for it and use it as part of the prediction process. Words that occur less than this amount are invisible to the model.

Evaluating the model.

The output of the model is a similarity score between the embedding for the text, and each of the genre labels. The similarity score is between -1 and 1, where a higher score means more similarity. There are many ways to measure the accuracy of the model. In this case we will take the model output, filter to only have the genres that that movie was originally labelled with, and average of the similarity scores. This way we aren’t penalising for the not so perfect labelling. Another option would be to look at precision and recall.

No over fitting for us.

Let’s have a look at where our model does well and not so well. The clean_genres column are the true genres, the top_5_genres are the top 5 predicted genres for that plot.

Next Steps.

To improve accuracy it might be worth trying different hyper-parameters for the model. We could consider increasing the embeddings size or decreasing the minCount. Trying hinge loss instead of softmax for the loss function, increasing the window size (ws) or trying different values for negSearchLimit.

--

--

Benjamin
Analytics Vidhya

Mathematics, Programming, Data Science, Deep Learning, Evolutionary Algorithms