MPST: Movie Plot Synopsis with Tags

Sai Teja Psk
Analytics Vidhya
Published in
6 min readDec 22, 2019
Movie Tags
Movie Tags

Hello guys, here we can see how to deal with a multi-class classification problem. For that, we will use the dataset called MPST from kaggle competitions, which is predicting the movie tags based on the movie synopsis, before going to the details. First, have a look into the data, I will give you the data source link that will help you to practice, click here to see the data source.

By seeing the data, several questions will come to our mind like…, “how to deal with this...?”, “which machine learning algorithm has to use...?”, “which metric has to use...?”, If you are new to machine learning definitely these questions will come to your mind, here my suggestion when you trying to do something with your data, first study the research paper about that data-set it is a very good habit to begin a case study click here to download the research paper for dataset MPST.

I hope you read the above research paper if not, please read that paper trust me it is very important, as my personal experience when I started this case study I ignored to read the research paper and applied a lot of algorithms and it took so many days to find out which algorithm is suitable, After reading the research paper I gained a lot of information.

Now, let’s dive into our case study MPST Movie plot synopsis with tags this case study basically, predicting the movie tags with help of movie synopsis (short story of a movie or summary of the movie ).

let’s see, EDA(Exploratory Data Analysis) and how is the data look like after reading the mpst__data.csv.

mpst__data.csv

Now, we will see some data analysis on Tags feature after cleaning the data like checking NaN values and dropping duplicate entries, I will share the complete code document at the end of this blog so you can check from that :)

Analysis of Tags feature:

Checking the number of tags per movie below, we can see how many tags count for each movie.

Tags count for each movie

checking the maximum, minimum and the average number of tags for each movie.

Tags analysis

let’s see some unique tags in a word cloud weight visualization( it means we can see that most occurred tag by seeing its size of the word in a word cloud, more the size it has more weight, it means most occurred tag).

If we observe the above word cloud most frequent 5 tags are murder, violence, flashback, romantic, cult

let’s see in frequency format.

If we observe the above word cloud we got the same kind of results the top 5 tags are murder, violence, flashback, romantic, cult.
How you know these steps to implement doing analysis like this
By Practice: start with this and try with similar data set and apply things what you learn from this blog

Yes, If you are new to machine learning you may strike this point how he knows all these steps, I learned all these things from previous assignments, which are in my course from that experience, I learned all those things and applied in this case study.

Data Pre-proceessing and vectorization:

Now let’s see Data Pre-Processing, before that how we select the feature as X we already know Y are tags, now how we select X…? by looking into the data what we want to predict movie tags based on the title feature, we cannot predict as much as accurate because it doesn’t have more information so that we can select plot_synopsis feature which contains more information which can help us to predict the tags with more accurate, so we can select our X as plot_synopsis.

Now, we will apply data pre-processing and text-to-numerical vectorization(BOW, TFIDF, AVGW2V, TFIDFW2V, CHAR-3, CHAR-4) on plot_synopsis feature, click this link to see the code implementation.

Converting tags to multi-label setting:

We already know pre-processed X feature and now we have to prepare our Y class labels to one-hot-encoded form, as shown in below:

# binary='true' will give a binary vectorizer
vectorizer = CountVectorizer(tokenizer = lambda x: x.split(), binary='true')
multilabel_y = vectorizer.fit_transform(pure_df['pre_pro_tags'])
multilabel_y.shape
# if we observe we have 71 unique tags its look like multi class
(14752, 71)

Modeling:

Now, we are ready with our data its time to apply models before applying models check out research paper what they implemented and we can also check “which algorithm & metric is suitable for multi-class classification” From the research paper, we are applying the OneVsRestClassifier with Logistic regression, TFIDF vectorization, and combination of all TFIDF vectorizers like uni, bi, trigrams, c3, c4 featurization.

alpha = [0.001,0.01,0.1,0.5,0.9,1,1.5,10,100,1000]
#penalty = ['l1','l2']
params = {'estimator__C': alpha}
clf_estimator_6 = OneVsRestClassifier(LogisticRegression(class_weight='balanced',penalty='l2',n_jobs=-1),n_jobs=-1)
#we using RandomizedSearchCV for hyper parameter tuning
RS_clf_6 = RandomizedSearchCV(estimator=clf_estimator_6, param_distributions=params, n_iter=10, cv=5, scoring='f1_micro', n_jobs=-1,verbose=10)
RS_clf_6.fit(x_train_uni_bi_tri, y_train_6)
print('Best estimator: ',RS_clf_6.best_estimator_)
print('Best Cross Validation Score: ',RS_clf_6.best_score_

Instead, of applying all the Vectorizations & algorithms. we are trying the research paper implementation. According to the research paper, we are using Logistic regression with OnevsRest. because it is a multi-class classification and we stick to TFIDF Vectorization with F1 score as a loss metric which is apt for multi-class classification. By applying all this, we will get the results like below.

Results:

Results close to the research paper 0.37 as highest F1-score

we can also apply the other factorizations like LDA Topic modeling but the results will not meet our expectations.

Applying models on Top-3 & 5 Tags:

Instead of predicting all the tags per movie what if we predict top-3 or top-5 tags…! we already have seen in data analysis on tags most of the movie has 3–5 tags

#top-3 tags 
#Refer above research paper we first analysis on top 3 tags
vectorizer = CountVectorizer(tokenizer = lambda x: x.split(), binary='true',max_features=3)# simply we pass max_features = 3 or 5
multilabel_y3 = vectorizer.fit_transform(pure_df['pre_pro_tags'])
multilabel_y3.shape
# it will return top 3 or 5 features based on max_features value
(14752, 3)

let’s see the top-3 and top-5 result’s:

We got 0.58 as highest F1 Score for predicting Top-3 Tags

By simple vectorizations, we got the best results that are close to the research paper. however, we can also hstack the LDA features and TFIDF_U_B_T_C3_C4 combined features we can get 0.48 as the highest F-1 score for all 71-tags.

Check out my GitHub link for full documented code and I suggest you try with my code in PaperSpace GPU’s which are faster to compute.

If you have any doubts, you can free to comment below that I’ll definitely get back to your query :)

I hope you guys like my way of explanation.

keep loving ML :)

Thank you.

References:

  1. https://www.appliedaicourse.com/
  2. https://www.aclweb.org/anthology/L18-1274/
  3. https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/
  4. https://stackoverflow.com/questions/49856775/understanding-character-level-feature-extraction-using-tfidfvectorizer

--

--