Movie Tag Prediction

Hemanth Yernagula
Analytics Vidhya
Published in
5 min readFeb 15, 2020

— — Romantic — — Horror — — Fiction — —

Well is there any chance to get the genre of a movie if the movie name is given?

No, but if you provide a summary of the movie there then we can build a machine learning model to predict genre for that movie, oh really? then how to do that?

Summary of a movie can be valuable for building machine learning models to predict tags of the movie which can help recommended systems to retrieve similar movies.

Okay let me tell you, you must be clear about few concepts before knowing how to build a movie tag prediction model. The soul of this article assumes that the reader is comfortable of below topic before moving further.

  1. Nlp featurization techniques.
  2. Machine learning algorithms like logistic regression.
  3. Python.

Contents

  1. Data Overview
  2. Cleaning & Preprocessing Data
  3. Featurizing Data
  4. Applying Models

Data Overview: The data used to build this model is taken from Kaggle and it looks like shown in figure1.

Figure 1

Column Explanation:

  1. imdb_id : IMDB movie ID.
  2. title: Movie name.
  3. plot_synopsis: Summary of the movie.
  4. tags : Tags of the movie.
  5. split: Represents train, test or validation data.
  6. synopsis_source:- From where the summary of the movie is collected from.

As you see here the target column is ‘tag’, it has many values (tags) which means that this is a multilabel problem. Multilabel means each data point will have a bunch of labels, unlike each data point, having one label, for example, each movie will have many tags like violence, cult, gothic, cruelty, sadist, feel-good, revenge, inspiring, romantic, stupid.

Performance metric I have chosen for this is f1-score

Micro-Averaged F1-Score (Mean F Score): The F1 score can be interpreted as a weighted average of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

In the multi-class and multi-label case, this is the weighted average of the F1 score of each class.

Cleaning & Preprocessing Data: In total there are 70 unique tags and each movie may have one or more tags. Plot for tags are as shown below

Figure 2

There are few tags like “sci-fi” if we clean the data this tag may be divided into two tags so we shall replace this sci-fi into science_fiction and tag for the first histogram is missing, something went wrong? Problem is even <space> is considered as a tag. So let us remove extra spaces in the data.

After making above changes the distribution the tags are as follows

Figure 3

Key Takeaway from the plot:

Most of the tags appear less than 1000 times

Only eight of the tags are appeared more than 1000 times

Murder is the tag that appeared max number of times i.e almost 6000 times

Let's look at word cloud for the tags

Figure 4

Key takeaway from the plot:

Tag that is with most frequency is murder and violence

Next important tags are flashback, romantic, revenge, cult and comedy

As I observed there are some special characters in the data like characters from other languages. I translated these characters into English using google translator API.

Finding out special characters

Figure 5

Replacing those special characters with their English translations

Figure 6

Some of the replaced words

Figure 7

We are done with cleaning the data featuring and applying models is left over 🙌🏻

Data is split based on the split column in the dataset

Figure 8

Featurization Techniques:

  1. Featurizing Text
  2. Featurizing Target Labels

1.Featurizing Text:

Featurization techniques used for text is

  1. TFIDF Uni grams
  2. TFIDF Bi grams
  3. TFIDF Tri grams
  4. TFIDF Char-3 grams
  5. TFIDF Char-4 grams
  6. TFIDF Uni grams + TFIDF Bi grams+TFIDF Tri grams
  7. TFIDF Char-3 grams+TFIDF Char-4 grams
  8. TFIDF Uni grams + TFIDF Bi grams+TFIDF Tri grams + TFIDF Char-3 grams+TFIDF Char-4 grams

2. Featurizing Target Labels:

As we know that the total class labels are 71 tags and each movie can get any number of tags within these 71 tags, Bag of Words technique is applied on these tags and each data point’s target label is represented with this bag of word vector.

After applying models on the above features below observations were made

Conclusion

1. The max f1score that is obtained is 37.805% ~ 38%.
2. Almost all the vectors are having above 30% f1score except Tri gram vector.
3. Normal logistic regression is getting the highest f1score compared with sgd logistic regression.
4.The best alpha value is 0.001 for logistic regression and 0.1 for sgd logistic regression upon Uni+Bi+Tri+Chr3+Chr4 vector.

--

--