MPST: Movie Plot Synopses with Tags

Ashishchoudhary
10 min readAug 26, 2022

--

Overview

Social tagging of movies reveals a wide range of heterogeneous information about movies, like the genre, plot structure, soundtracks, metadata, visual and emotional experiences. Such information can be valuable in building automatic systems to create tags for movies. Automatic tagging systems can help recommendation engines to improve the retrieval of similar movies as well as help viewers to know what to expect from a movie in advance.

PROBLEM STATEMENT :

What is the problem?

We need a program that assigns appropriate tags (genre) to a movie providing ‘Movie title’, ‘Plot synopsis’ and synopsis source’. These tags helps viewers to have prior intuition about the movie in advance.

TASK : Predict tags (multiple-labels) provided narrative elements like plot synopsis.

EXPERIENCE : A corpus of Trained data provided with multiple labels.

PERFORMANCE : ‘Micro F1 score’ and ‘accuracy’ will be used as performance parameter. F1 score convey better results as compare to accuracy score for classification tasks.

Solution benefits :

Genre is important for audiences because it allows them to know what kind of film they are going to see and what they can expect when going to see a film. For example if the genre is horror, then the audience know they are going to see a scary film and they should expect to feel scared.

REAL WORLD / BUSINESS OBJECTIVES AND CONSTRAINTS :

  • Predict as many tags as possible with high precision and recall.
  • Incorrect tags could impact customer experience.
  • No strict latency constraints.

Data source:

Data is downloaded from publicly available source i.e. Kaggle.

Source link :

Multilabel v/s Multiclass classification :

In multi-label classification, each instance in the training set is associated with a set of labels, instead of a single label, and the task is to predict the label-sets of unseen instances, instead of a single label.
There is a difference between multi-class classification and multi-label classification. In multi-class problem the classes or labels are mutually-exclusive, i.e. it makes the assumption that each instance can be assigned to only one label. E.g — an animal can be either a dog or a cat but not both. But in multi-label problem multiple labels may be assigned to an instance. E.g — a movie can belong to a comedy genre as well a detective genre.

Dataset and it’s understanding :

Exploratory data analysis (EDA)

Categorical Features :

Distribution in train, test and validation

Data distribution on train, test and validation

Distribution of data are as follows :

  1. train : 64%
  2. test : 20%
  3. cv:16%

Text features :

  1. Tags
  2. Title
  3. Plot synopsis

let’s start by looking at the common words present in the each of the text features. For this, I will use the document term matrix created earlier with word clouds for plotting these words. Word clouds are the visual representations of the frequency of different words present in a document. It gives importance to the more frequent words which are bigger in size compared to other less frequent words.

Some of most frequent tags are :

* Murder, violence, flashback, romantic, cult, revenge

Here we can clearly observe most of the words used in tags like ‘The’, ‘of’, ‘and’, ‘in’, ‘to’ etc. are stop words.

The word ‘The’ is used for exceptionally large amount of times.

Here we can clearly observe most of the words used in plot_synopsis are : ‘The’, ‘of’, ‘and’, ‘in’, ‘to’ etc.

Distribution of tag frequency :

The distribution of tag frequency seems to follow ‘Pareto Distribution’.

  • Around 80% of outcomes result from 20% of all causes.
  • Highest and lowest frequent words may not be that useful in tag prediction.
  • Preprocessing is needed for most useful words.

Box plot for tag frequency :

A boxplot is a standardized way of displaying the distribution of data based on a five number summary (“minimum”, first quartile (Q1), median, third quartile (Q3), and “maximum”).

Here we can observe that points are distributed in region unto frequency = 2000

Preprocessing :

Finding duplicates rows :

To find duplicates on specific column, we can simply call duplicated() method on the column. Similarly to drop duplicated entries drop_duplicates() can be used.

Preprocessing on plot synopsis :

Generally, text data contains a lot of noise either in the form of symbols or in the form of punctuations and stop words. Therefore, it becomes necessary to clean the text, not just for making it more understandable but also for getting better insights.

  • Removing titles like Dr., Mr., Mrs., Miss, Master, etc.
  • Removing stop words.
  • Removing Special Characters.
  • Stem all the words.
  • Encoding all persons names as ‘person’.
  • Sentiment analysis :

The dataset after preprocessing looks like :

Vectorization :

Machines cannot understand characters and words. So when dealing with text data we need to represent it in numbers to be understood by the machine. Count vectorizer is a method to convert text to numerical data.

Since it is a multi-label classification, so the output labels need to be one-hot encoded. We have used Bag of words technique using the sci-kit learn method for this.

taking example of one data tag vectors :

Encoding :

The encoding technique whitens the real-valued input data delivered to the first hidden layer of a fully-connected neural network (FCNN) thereby providing the training speedup. To make rows of equal length padding is used. Encoding is done for both train and test data.

Modeling :

Using LSTM :

LSTM stands for Long-Short Term Memory. LSTM is a type of recurrent neural network but is better than traditional recurrent neural networks in terms of memory. Having a good hold over memorizing certain patterns LSTMs perform fairly better. Here I have trained the model as follows-

In multi-label classification instead of softmax(), we use sigmoid() to get the probabilities.

After fitting the parameters set number of epochs as 5 and executed the code and got results as follows:

Below is the Tensorboard plots of

  1. Behavior of accuracy with epoch number.
  2. Behavior of loss with epoch number.

In above model we have used layers as follows :

* Embedding_layer >> LSTM >> Drop_out >> LSTM >> Drop_out >> LSTM

* Validation accuracy = 16.19 %

Trend of F1_score with respect to threshold values :

* F1_score is observed to decrease with rise in threshold values.

* Best F1_score = 32.28%

Using CNN :

CNN is just a kind of neural network; its convolutional layer differs from other neural networks. Performing with this all features of a matrix makes CNN more sustainable to data of matrix form.

After fitting the parameters set number of epochs as 5 and executed the code and got results as follows:

Below is the Tensorboard plots of

  1. Behavior of accuracy with epoch number.
  2. Behavior of loss with epoch number.

In above model we have used layers as follows :

* Embedding_layer >> Drop_out >> Conv1D >> GlobalMaxPool >> Dense_layer

* Validation accuracy = 6.69 %

Trend of F1_score with respect to threshold values :

* F1_score is observed to be constant (= 0.0836) for initial threshold values.

  • Best F1_score = 8.36%

Using LSTM + CNN :

Lets try to experiment by combining above two models:

After fitting the parameters set number of epochs as 5 and executed the code and got results as follows:

Below is the Tensorboard plots of

  1. Behavior of accuracy with epoch number.
  2. Behavior of loss with epoch number.

In above model we have used layers as follows :

* Embedding_layer >> Drop_out >> LSTM >> Conv1D >> GlobalMaxPool >> Dense_layer

* Validation accuracy = 16.51 %

Trend of F1_score with respect to threshold values :

* F1_score is observed to be rise initially with increasing thresholds and then fall for higher thresholds.

* Best F1_score = 35.90%

Using BERT Model :

In Oct 2018, Google released a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. BERT builds upon recent work in pre-training contextual representations — including Semi-supervised Sequence Learning, Generative Pre-Training, ELMo, and ULMFit. However, unlike these previous models, BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus (Wikipedia).

We will use basic model: ‘uncased_L-12_H-768_A-12’
BERT_MODEL_HUB = “https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"

It uses L=12 hidden layers (i.e., Transformer blocks), a hidden size of H=768, and A=12 attention heads.

For tokenization I have used tokenization.FullTokenizer()

Finally the output layer of pretrained BERT model is used as input layer of model 4.

For this model I selected epochs = 30 to get better results.

Below is the Tensorboard plots of

  1. Behavior of accuracy with epoch number.
  2. Behavior of loss with epoch number.

* In above model we have used pretrained BERT model

* Validation accuray is observed to rise with rise in epoch numbers.

Trend of accuracy with respect to threshold values :

* Accuray is observed to be almost constant with slighte variations.

* Validation accuracy = 16.71 %

Conclusion :

* From above table it is clear that out of four models all models perform similar except CNN.

* Accuracies of all the models are significantly low.

* Model with best accuracy score is ‘BERT’.

* Here we have scope of improvement.

Future work :

  • We can observe that accuracy scores are not up to the mark so here we have scope of improvement.
  • Results can be improved by using feature engineering techniques.

References :

https://www.kaggle.com/datasets/cryptexcode/mpst-movie-plot-synopses-with-tags

GitHub Link :

LinkedIn Profile:

https://www.linkedin.com/in/ashish-choudhary-073a00148

--

--