Movie Tags Prediction Using Machine Learning Models.

Sandeep Kumar Panda
Analytics Vidhya
Published in
18 min readMay 25, 2020
Source : https://www.researchgate.net/figure/Color-online-Word-Clouds-for-a-Childrens-and-b-Romantic-Movies_fig3_329106615

Salut Les gars!!!

In this blog post we will talk about solving a Multi-Label Classification problem.

Predicting Tags for movies helps us to find out information like the genre, plot structure, metadata, and emotional experiences of the Movies. That information can be useful in building automated systems to predict tags for movies.

In this case study, we will focus on building an automated engine which can extract tags from a movie plot synopsis data. Plot synopsis is nothing but a detailed or partial summary of a movie. Note that a particular movie might have either one single tag or it might have more than one tags. This is where multi-label classification comes into play.

Since we are discussing Multi-label classification Problem let’s discuss the difference between Multi-Class and Multi-Label Problems.

Multiclass Classification means a classification task with more than two classes; e.g., classify a set of Movies genres which may be Action, Comedy, or Adventure. Multiclass classification makes the assumption that each sample is assigned to one and only one label: a genre can be either an Action or Comedy but not both at the same time.

On the other hand, Multilabel Classification assigns to each Movie a set of target labels. e.g., classify a set of Movies which may be Action, Comedy, Adventure or Action, Comedy, Horror, Thriller. A label might be about any of religion, politics, finance or education at the same time or none of these.

You can read more about multi-label and multi-class classification with detailed examples here.

Source: https://www.microsoft.com/en-us/research/uploads/prod/2017/12/40250.jpg
On the left, we have a binary classification setting where an email can be classified as either spam or not spam. The picture in the middle shows a multi-class classification setting, where an animal can belong to one and only one class. At the extreme right, we have a multi-label classification setting where objects in the image can belong to one or multiple classes — ‘Cat’ and ‘Bird’. Source: https://www.microsoft.com/en-us/research/uploads/prod/2017/12/40250.jpg

Data:

The data which we will use in this blog sourced from Kaggle. This data consists of approx 14K movies and 71 unique tags.

Each data point has the following attributes.

  1. IMDB_ID: An unique identifier which contains the id of the movie assigned by IMDB.
  2. Title: This title attribute contains the movie name.
  3. Plot_Synopsis: This field contains the movie plot summaries.
  4. Tags: This attribute contains information about all the tags the movie is assigned to.
  5. Split: Position of the movie in the standard data split, indicates whether a data point belongs to train, test or validation data
  6. Synopsis_Source: Source from where the plot synopsis for each movie was collected — either IMDB or Wikipedia.

Exploratory Data Analysis of Tags feature.

1. a. Loading the data, display the first 2 rows.

data = pd.read_csv(“mpst_full_data.csv”)
print(“Number of data points: “, data.shape[0])
data.head(2)
Figure 1 : Top 2 rows of dataset

b. Let’s find out the number of data points and attributes present in the dataset:

data.shape
(14828, 6)

Here, we can see that we have a total of 14828 number of data points/rows. As seen above, each data point has 6 attributes/columns.

2. Creating a SQL db file from the given CSV file and deleting the duplicate entries present in the dataset.

a. Creating a db file:

#Learn SQL: https://www.w3schools.com/sql/default.asp
start = datetime.now()
if not os.path.isfile('mpst.db'):
disk_engine = create_engine('sqlite:///mpst.db')
start = dt.datetime.now()
chunksize = 15000
j = 0
index_start = 1
for df in pd.read_csv('mpst_full_data.csv', chunksize=chunksize, iterator=True, encoding='utf-8'):
df.index += index_start
j+=1
df.to_sql('mpst_full_data', disk_engine, if_exists='append')
index_start = df.index[-1] + 1
else:
print("Database Already Exist.")
print("Time taken to run this cell :", datetime.now() - start)

b. Deleting the duplicate entries present in the dataset:

con = sqlite3.connect('mpst.db')data_no_dup = pd.read_sql_query('SELECT title,plot_synopsis,tags,split,synopsis_source,COUNT(*) as cnt_dup FROM mpst_full_data GROUP BY title', con)con.close()

c. Let’s create an another column called “tag_count”, which will count the number of tags present for each movie.

data_no_dup["tag_count"] = data_no_dup["tags"].apply(lambda text: len(str(text).split(", ")))
data_no_dup.head()
Figure 2 : Modified dataset with a new column “tag_count”

d. Let’s find out the number of data points and attributes present in the dataset:

data_no_dup.shape
(13757, 7)

Here, we can see that we have a total of 13757 number of data points. As seen above, each data point has 7 attributes.(Previously 6 + 1 new custom created)

3. Distribution of train, validation and test data points

sns.countplot(data_no_dup['split'])
plt.show()
Figure 3 : Distribution of Data Points

4. Checking the Sources of Data Distribution

sns.countplot(data_no_dup['synopsis_source'])
plt.show()
Figure 4 : Distribution of movies from various sources

5. Checking the distribution of tag(s) per movie:

plt.figure(figsize=(20,5))
plt.plot(data_no_dup["tag_count"])
plt.xlabel('movies')
plt.ylabel("no.of tags per movie")
plt.show()
Figure 5 : Tag(s) per movies

6. Finding out the exact numbers:

data_no_dup["tag_count"].value_counts()
Figure 6 : How many movies contain how many tags [Left Column: No. of Tags, Right Column: No. of Movies]

7. Counting the number of unique tags present in the dataset:

To do this we have follow the BoW(Bag of Words) technique by applying the CountVectorizer method.

What is BoW?

It is the simplest representation of Text Documents. In other words it will find out, how many times a particular word has occurred in the particular document.

To simplify it more, we can say that it will choose the words and its occurrence frequencies then put them in a bag. Hence its name Bag of Words(Bow)

vectorizer = CountVectorizer(preprocessor=lambda x: x,tokenizer = lambda x: str(x).split(", ") )
tag_vect = vectorizer.fit_transform(data_no_dup["tags"])
print("Number of data points :", tag_vect.shape[0])
print("Number of unique tags :", tag_vect.shape[1])
Number of data points : 13757
Number of unique tags : 71
tags = vectorizer.get_feature_names()#zipping tags and tags_count into one list
freqs = tag_vect.sum(axis=0).A1
result = list(zip(tags, freqs))

8. Wordcloud for most occuring tags :

Figure 7 : Wordcloud for most occurring tags

From the above image we can conclude that the tags such as: “murder”, “flashback”, “violence”, “romantic”, “psychedelic” and “cult” are most occurring.

Similarly the tags like “good versus evil”, “entertaining”, “suspenseful” belongs to slightly less occurring.

Data Cleaning

In this section we will see the techniques that we used to do the data pre-processing.

  1. Removing of any HTML tags present in the dataset.
  2. De-Contraction of words( like won’t = will not ).
  3. Convert every word to lowercase.
  4. Remove Stopwords.
  5. Words are Lemmatized. ( Words that are in third person are changed to first person and verbs that are in past & future tense are changed into present tense ).
  6. Lastly Snowball Stemming.
def decontracted(phrase):
# specific
phrase = re.sub(r"won't", "will not", phrase)
phrase = re.sub(r"can\'t", "can not", phrase)
# general
phrase = re.sub(r"n\'t", " not", phrase)
phrase = re.sub(r"\'re", " are", phrase)
phrase = re.sub(r"\'s", " is", phrase)
phrase = re.sub(r"\'d", " would", phrase)
phrase = re.sub(r"\'ll", " will", phrase)
phrase = re.sub(r"\'t", " not", phrase)
phrase = re.sub(r"\'ve", " have", phrase)
phrase = re.sub(r"\'m", " am", phrase)
return phrase

stopwords = set(stopwords.words('english'))
sno = nltk.stem.SnowballStemmer('english')
lemmatizer = WordNetLemmatizer()

preprocessed_synop = []
for sentance in tqdm(data_no_dup['plot_synopsis'].values):
sentance = re.sub(r"http\S+", "", sentance)
sentance = BeautifulSoup(sentance, 'lxml').get_text()
sentance = decontracted(sentance)
sentance = re.sub("\S*\d\S*", "", sentance).strip()
sentance = re.sub('[^A-Za-z]+', ' ', sentance)
stemmed_sentence = []
for e in sentance.split():
if e.lower() not in stopwords:
s=(sno.stem(lemmatizer.lemmatize(e.lower()))).encode('utf8')#lemitizing and stemming each word
stemmed_sentence.append(s)
sentance = b' '.join(stemmed_sentence)
preprocessed_synop.append(sentance)

data_no_dup['CleanedSynopsis'] = preprocessed_synop #adding a column of CleanedText which displays the data after pre-processing of the review
data_no_dup['CleanedSynopsis'] = data_no_dup['CleanedSynopsis'].str.decode("utf-8")

Original Text:

A 6th grader named Griffin Bing decides to gather their entire grade in a sleepover protest in an old house about to be demolished after their plan for using a new space in their town was thrown out because of their youth. However, only Griffin and his best friend Ben Slovak show up. Griffin discovers a Babe Ruth baseball card that, unbeknownst to him, is worth huge amounts of money. Excited that the card could help his family, which is struggling financially, Griffin takes it to the local collectibles dealer, S. Wendell Palomino. S. Wendell tells the boys that the card is an old counterfeit of a valuable one, worth only one hundred dollars. A dejected Griffin later chances upon Palomino on television, stating that the card he stole was worth at least a million dollars. Enraged, Griffin and Ben try to steal it back from Swindle\'s shop, only to find that it has gone, and they have to break into Swindle\'s house. Now, in order to get the card back, Griffin must gather a team of local students with unique skills to break into Palomino\'s heavily guarded home to retrieve the card before the big auction where Swindle plans to sell the card. The team consists of seven people (including Ben and Griffin): Savannah the Dog whisperer, to get past Swindle\'s massive, violent Guard Dog Luthor; Logan the actor, to distract Swindle\'s eagle-eyed neighbor who spends his days watching the entire street\'s goings-on; Antonia "Pitch" Benson the "born to climb" girl, to scale the skylight in Swindle\'s house; Darren Vader who the others had no choice but to add to the team, for he threatened to rat them out (But Darren proved to be useful pulling people up the skylight); Melissa the unsociable computer genius, who was used to break into Swindle\'s UltraTech alarm system. The tension is piled with an unexpected visit from the auctioneer, yet another even more menacing guard dog, and a betrayal from the person who begged to be in the group. The book was followed by multiple sequels, titled Zoobreak, Framed!, Showoff, Hideout , Jackpot and Unleashed.

Cleaned Text:

grader name griffin bing decid gather entir grade sleepov protest old hous demolish plan use new space town thrown youth howev griffin best friend ben slovak show griffin discov babe ruth basebal card unbeknownst worth huge amount money excit card could help famili struggl financi griffin take local collect dealer wendel palomino wendel tell boy card old counterfeit valuabl one worth one hundr dollar deject griffin later chanc upon palomino televis state card stole worth least million dollar enrag griffin ben tri steal back swindl shop find gone break swindl hous order get card back griffin must gather team local student uniqu skill break palomino heavili guard home retriev card big auction swindl plan sell card team consist seven peopl includ ben griffin savannah dog whisper get past swindl massiv violent guard dog luthor logan actor distract swindl eagl eye neighbor spend day watch entir street go antonia pitch benson born climb girl scale skylight swindl hous darren vader other choic add team threaten rat darren prove use pull peopl skylight melissa unsoci comput genius use break swindl ultratech alarm system tension pile unexpect visit auction yet anoth even menac guard dog betray person beg group book follow multipl sequel titl zoobreak frame showoff hideout jackpot unleash

Machine Learning Approach for Predicting Movie Tags

First we have to read the data from the csv file and split the data into train and test using the split column given in the dataset.

Reading the data

data_with_all_tags = pd.read_csv("/content/drive/My Drive/ML/data_with_all_tags.csv")
data_with_all_tags.head()

Splitting the into Train and Test

conn = sqlite3.connect('data.db')
data_with_all_tags.to_sql('data', conn, if_exists='replace', index=False)
train = pd.read_sql("Select * From data where split = 'train' OR split='val'",conn)
test = pd.read_sql("Select * From data where split = 'test'",conn)
conn.close()

let’s discuss some of the terms we will use in the model.

OneVSRestClassifier : This strategy consists in fitting one classifier per class. For each classifier, the class is fitted against all the other classes.

CountVectorize : Convert a collection of text documents to a matrix of token counts

TfidfVectorizer : Assigns more weight to less frequent words. In simple words, TFIDF is a product of Term Frequency (TF) and Inverse Document Frequency (IDF).

TF = (Number of times term t appears in a document) / (Total number of terms in the document)

IDF = log(Total number of documents / Number of documents with term t in it)

Metrics:

True Positive (TP) = Among all the correctly predicted positive points that are actually correct.

False Positive (FP) = Among all the correctly predicted positive points that are actually incorrect.

True Negative (TN) = Among all the correctly predicted negative points that are actually correct.

False Negative (FN) = Among all the correctly predicted negative points that are actually incorrect

Precision is a fraction of true positives among the sum of true positive and false positive.

Precision=TP/(TP+FP)

Recall is a fraction of true positives among the sum of true positive and false negative.

Recall=TP/(TP+FN)

F1-score is a harmonic mean of precision and recall.

f1= 2(Precision*Recall)/(Precision+Recall)

The metric we are going to use is Micro-f1, which is the harmonic mean of micro-Precision and micro-Recall.

Micro-precision is the sum of all true positives to sum of all true positives and false positives.

micro-precision=TP1+TP2+…./((TP1+TP2+…)+(FP1+FP2+…))

For cross-validation, I had used GridSearchCV with 5 fold cross-validation.

Machine Learning Models:

Using All Tags:

  1. Using TFIDF Vectorizer :

In the baseline model building section, we have tried numerous models using MultinomialNB, Logistic Regression, SGD Classifiers with log loss, SGD Classifier with hinge loss. In all the cases we want to maximize the micro averaged F1 score.The baseline model which gave us the maximum value of micro averaged F1 score is Logistic Regression(0.2601)

Let’s look at a simple code snippet we have used for OneVsRest with Logistic Regression. Throughout the experiment, I have used the same code structure. So, if you want to change any model, just replace LogisticRegression with any model and things will work.

lr = LogisticRegression(class_weight='balanced')clf = OneVsRestClassifier(lr)
clf.fit(X_train_multilabel, y_train_multilabel)

prediction16 = clf.predict(X_test_multilabel)
precision16 = precision_score(y_test_multilabel, prediction16, average='micro')recall16 = recall_score(y_test_multilabel, prediction16, average='micro')f1_score16 = 2*((precision16 * recall16)/(precision16 + recall16))print("precision16: {:.4f}, recall16: {:.4f}, F1-measure: {:.4f}".format(precision16, recall16, f1_score16))

Output of above code block:

precision16: 0.1673, recall16: 0.5839, F1-measure: 0.2601
Figure 8: Score comparisons of different models of Baseline Model

Let’s Compare between Actual tags and model predicted tags

Figure 9: Comparison between Actual tags & Model Predicted Tags

2. Using AVGW2V :

Same as above discussion here also LogisticRegression gave the highest F1 score (0.214)

Figure 10: Score comparison of AVG W2V models

Let’s see the tag predicted by the model :

Figure 11: Tag comparison between actual tags and predicted tags

3. Using LSTM-CNN Model :

Figure 12: LSTM-CNN model

Let’s find out the accuracy :

test_loss, test_acc = model.evaluate(X_test, y_test_multilabel, verbose=2)print('\nTest accuracy:', test_acc)
Test accuracy: 0.13583815097808838

So we are not getting much accuracy here also.

Tag Comparison:

Figure 13: Tag Comparison

Modelling with Top 3 Tags

In the EDA section, we have seen that an average movies consists of three tags. So let’s build a model which can predict top three tags . We have used the same set of featurizations but this time, the number of tags were equal to 3

cnt_vectorizer = CountVectorizer(tokenizer = tokenize, max_features=3, binary='true').fit(y_train
  1. Using TFIDF Vectorizer :

Here SGDClassifier with log loss gave us the highest F1 score i.e. 0.586

Figure 14 : Score Comparison of Top 3 tags

Here we can see the significant improvements in the accuracy scores.

Tag Comparison :

Figure 15 : Comparison between actual tag and model predicted tags

2. Using AVGW2V :

Figure 16 : Score Comparisons

Here we can see that LogisticRegression gave the highest F1 score(0.562)

Tag Comparisons :

Figure 17 : Tag Comparisons

3. LSTM-CNN Model :

Figure 18 : LSTM-CNN Model for Top 3 tags

Here we got the accuracy of 0.283

Tag Comparison :

Figure 19 : Actual Tags VS Predicted Tags

Modelling with Top 5 Tags

Here, the code is change is pretty simple. We have to only change the value of “max_features” from 3 to 5 and remaining of the code snippet remains the same.

cnt_vectorizer = CountVectorizer(tokenizer = tokenize, max_features=5, binary='true').fit(y_train)
  1. Using TFIDF Vectorizer :
Figure 20 : Accuracy Scores

Here also LogisticRegression gave the highest F1 score(0.535)

Tag Comparison :

Figure 21: Actual Tags vs Predicted Tags

2. Using AVGW2V :

Figure 22 : Accuracy Scores using AVG W2V

Here the SGDClassifier with hinge loss gave the highest F1 score.

Tag Comparsion :

Figure 23 : Actual Tags vs Predicted Tags

3. LSTM-CNN Model :

Figure 24 : LSTM-CNN Model

Here we got the accuracy of 0.213

Tag Comparisons :

Figure 25 : Actual Tags vs Predicted Tags

Modelling with TOP 30 Tags

In this section we have to manually select the top 30 occurring tags out of all 71 tags. To do that first we have to read the dataset and then vectorize the tags using BoW technique to find which occurs how many times.

data = pd.read_csv("data_with_all_tags.csv")vectorizer = CountVectorizer(preprocessor=lambda x: x, tokenizer = lambda x: str(x).split(", ") )
tag_vect = vectorizer.fit_transform(data["tags"])
tags = vectorizer.get_feature_names()
freqs = tag_vect.sum(axis=0).A1
result = list(zip(tags, freqs))
print((result))

here is the output of the above code cell :

Figure 26 : All the Tags with their occurrences

Now we have to create a dataframe from this, then sort it in descending order based on the tag occurrences. After that we have to select TOP 30 tags depending upon their frequency.

tag_counts = pd.DataFrame(result,columns=['tag','tag_counts'])tag_counts_sorted = tag_counts.sort_values(['tag_counts'], ascending=False)tag_counts = tag_counts_sorted['tag'][:30]
print(tag_counts)

let’s see the top 30 tags :

Figure 27 : TOP 30 Tags

Now we have to compare these top 30 tags with all the tags present per movie, then delete all the tags beside these 30 tags.

Then delete the rows which has no tag because of the pre processing we did.

Before Compare :

Figure 28 : Original Tags per Movie

Here we can see the Length:13757, which means we have 13757 number rows (or) Movies.

After Compare :

Figure 29 : Tags per Movie after the Comparison

Here we can notice the length has been reduced to 13010 that means all the movies which had no tags were removed.

Now the same processes will be followed as previous models to get the F1 scores

  1. AVGW2V :
Figure 30 : Accuracy Scores

LogisticRegression gave the highest F1 score(0.32)

Tag Comparison :

Figure 31 : Actual Tags vs Predicted Tags

2. LSTM-CNN :

Figure 32 : LSTM-CNN Model

Here we got the accuracy of 0.043

Tag Comparison :

Figure 33 : Actual Tags vs Predicted Tags

Modelling with TOP 5 Tag

Here we will follow the same steps as the previous one(top 30) instead of 30 tags will do it for top 5 tags.

In this section we won’t use sklearn’s CountVectorizer method instead of that we will do that manually to get some higher accuracy.

After Implementing Onehotencoding of tags manually it should look like this:

Figure 34 : OneHotEncoding of Tags

After this rest of the processes are all same.

  1. AVGW2V:
Figure 35 : Accuracy Scores

Here we can see that LogisticRegression gave the maximum F1score(0.59) not just for this model but every other models we had tried.

Tag Comparison :

Figure 36 : Actual Tags vs Predicted Tags

In the tag prediction also we can see the difference between this model and every other model.

2. LSTM Model :

Figure 37 : LSTM Model

Here we got the accuracy of 0.66 which quite significant as compared to other LSTM models or ML models

Tag Prediction :

Figure 38 : Actual Tags vs Predicted Tags

TOPIC Modelling

In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents.

Topic modeling is a frequently used text-mining tool for discovery of hidden semantic structures in a text body.

Latent Dirichlet Allocation(LDA) is a popular algorithm for topic modeling with excellent implementations in the Python’s Gensim package.

Let’s get started

Let’s read the pre-processed dataset

dataframe = pd.read_csv("data_with_all_tags.csv")

Now let’s extract only those features we need

data = dataframe[['title', 'plot_synopsis', 'tags', 'split', 'CleanedSynopsis']]data.shape
(13757, 5)

Lets append all the words of a Synopsis to create Data Corpus

data["synopsis_words"] = data["CleanedSynopsis"].apply(lambda x: x.split())data_words=[]
for sent in data["synopsis_words"].values:
data_words.append(sent)

let’s create a Dictionary and Corpus require for LDA

id2word = corpora.Dictionary(data_words)corpus = [id2word.doc2bow(text) for text in data_words]

Building the Topic Model:

We have everything required to train the LDA model. In addition to the corpus and dictionary, you need to provide the number of topics as well.

# Build LDA model
lda_model = gensim.models.LdaMulticore(corpus=corpus,
id2word=id2word,
num_topics=10,
random_state=100,
chunksize=10,
passes=10,
alpha='symmetric',
iterations=100,
per_word_topics=True,
workers=7)

View the topics in LDA model:

The above LDA model is built with 10 different topics where each topic is a combination of keywords and each keyword contributes a certain weightage to the topic.

You can see the keywords for each topic and the weightage(importance) of each keyword using lda_model.print_topics() as shown next.

pprint(lda_model.print_topics())[(0,
'0.009*"kill" + 0.006*"father" + 0.006*"get" + 0.005*"love" + 0.004*"famili" '
'+ 0.004*"brother" + 0.004*"son" + 0.004*"meet" + 0.004*"take" + '
'0.004*"fight"'),
(1,
'0.014*"tell" + 0.012*"go" + 0.011*"see" + 0.011*"say" + 0.011*"back" + '
'0.010*"get" + 0.007*"ask" + 0.007*"find" + 0.007*"look" + 0.007*"room"'),
(2,
'0.015*"kill" + 0.011*"polic" + 0.007*"car" + 0.006*"shoot" + 0.006*"offic" '
'+ 0.006*"frank" + 0.006*"man" + 0.006*"john" + 0.006*"find" + 0.005*"gun"'),
(3,
'0.008*"georg" + 0.008*"get" + 0.008*"billi" + 0.007*"jim" + 0.007*"ray" + '
'0.006*"scott" + 0.006*"go" + 0.005*"rachel" + 0.005*"find" + 0.005*"bella"'),
(4,
'0.009*"hous" + 0.008*"find" + 0.007*"mother" + 0.007*"mari" + '
'0.007*"father" + 0.006*"child" + 0.006*"kill" + 0.006*"home" + '
'0.005*"woman" + 0.005*"famili"'),
(5,
'0.016*"go" + 0.015*"tell" + 0.014*"get" + 0.012*"say" + 0.012*"sam" + '
'0.011*"david" + 0.009*"mike" + 0.009*"charli" + 0.008*"ask" + '
'0.008*"sarah"'),
(6,
'0.008*"human" + 0.007*"power" + 0.006*"world" + 0.005*"use" + 0.005*"earth" '
'+ 0.005*"destroy" + 0.004*"find" + 0.004*"one" + 0.004*"alien" + '
'0.004*"reveal"'),
(7,
'0.006*"king" + 0.006*"kill" + 0.005*"return" + 0.005*"vampir" + '
'0.004*"villag" + 0.004*"take" + 0.004*"arriv" + 0.004*"one" + '
'0.003*"father" + 0.003*"son"'),
(8,
'0.008*"kill" + 0.006*"war" + 0.006*"attack" + 0.006*"soldier" + '
'0.005*"ship" + 0.005*"order" + 0.005*"forc" + 0.005*"men" + 0.005*"group" + '
'0.005*"escap"'),
(9,
'0.006*"new" + 0.005*"love" + 0.005*"life" + 0.005*"time" + 0.005*"one" + '
'0.005*"friend" + 0.005*"make" + 0.005*"day" + 0.004*"go" + 0.004*"film"')]

How to interpret this?

Topic 0 is a represented as ‘0.009*”kill” + 0.006*”father” + 0.006*”get” + 0.005*”love” + 0.004*”famili” ‘ ‘+ 0.004*”brother” + 0.004*”son” + 0.004*”meet” + 0.004*”take” + ‘ ‘0.004*”fight”’

It means the top 10 keywords that contribute to this topic are : “kill”, “father”, “get”.. & so on and the the weight of “kill” on topic 0 is 0.009

The weights reflect how important a keyword is to that topic.

Finding the dominant topic in each sentence:

One of the practical application of topic modeling is to determine what topic a given document is about.

To find that, we find the topic number that has the highest percentage contribution in that document.

The format_topics_sentences() function below nicely aggregates this information in a presentable table.

data_list = dataframe.CleanedSynopsis.values.tolist()def format_topics_sentences(ldamodel=None, corpus=corpus, texts = data_list):
# Init output
sent_topics_df = pd.DataFrame()
# Get main topic in each document
for i, row_list in enumerate(ldamodel[corpus]):
row = row_list[0] if ldamodel.per_word_topics else row_list
# print(row)
row = sorted(row, key=lambda x: (x[1]), reverse=True)
# Get the Dominant topic, Perc Contribution and Keywords for each document
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # => dominant topic
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
else:
break
sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
# Add original text to the end of the output
contents = pd.Series(texts)
sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
return(sent_topics_df)
df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data_words)# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
Figure 29 : Dominant Topics

Lets save these dominant topics to a csv file and then concatenate this with our original data.

df_dominant_topic.to_csv("dominant_topics.csv")df_topic = pd.read_csv("dominant_topics.csv")
combined_df = pd.concat([data, df_topic], axis=1)

Let’s split the dataset into train and test, apply the vectorizer as we had done previously to get the f1 score

data_test=combined_df.loc[(combined_df['split'] == 'test')]
data_train=combined_df.loc[(combined_df['split'] == 'val') | (combined_df['split'] == 'train')]

After Applying TFIDF Vectorizer we got the following accuracy scores.

Figure 30 : Accuracy Scores

Let’s Implement our model using Flask

First write the desired code in the Flask to create an API. Here is a nice tutorial for it.

Let’s run it.

Figure 31 : Index Page

After opening the Index page, enter a Plot Synopsis to predict the tag

Figure 32 : Predicted Tag

So, Our model predicted a tag for our synopsis we gave i.e. “Flashback”.

Conclusion :

  1. The maximum micro averaged F1 score we have obtained from the entire project is 0.59.
  2. Generally we are used to see accuracy above 90% But we didn’t have a big dataset still we got the decent F1 score.
  3. For the Topic Modelling the highest f1 score we got is 0.374 for LogisticRegression

References :

  1. Research Paper : https://arxiv.org/pdf/1802.07858.pdf
  2. Conceptual Help : Applied AI Course
  3. Dataset : https://www.kaggle.com/cryptexcode/mpst-movie-plot-synopses-with-tags
  4. Github Link : https://github.com/sandeeppanda22/Movie-Tag-Prediction
  5. LinkedIn Profile : https://www.linkedin.com/in/sandeepkumarpanda/

--

--