透過自然語言分析技術(NLP)，利用電影劇情大綱文字，探勘電影之間的相似程度

透過NLP分析電影劇情大綱文字，量化電影之間的相似度

Sharko Shen

Published in

Data Science & LeetCode for Kindergarten

14 min readApr 25, 2020

前言&目標:

電影種類百百種，雖然現在市面上已經有既定的分類，但今天我們要透過IMDB及Wikipedia上各個電影的劇情摘要，利用機器學習中的非監督式學習(Unsupervised Learning)，意思就是數據在沒有標籤Label的情況下進行分類。

第一步: 讀取資料

# Import modules
import numpy as np
import pandas as pd
import nltk# Set seed for reproducibility
np.random.seed(5)# Read in IMDb and Wikipedia movie data (both in same file)
movies_df = pd.read_csv('datasets/movies.csv')print("Number of movies loaded: %s " % (len(movies_df)))# Display the data
movies_df

說明: 可以看到總共有一百部電影，資料內容包含電影名稱，種類，以及在維基百科跟imdb的劇情介紹文字內容。

第二步: 合併Wikipedia與imdb文字內容

雖然這兩欄同樣是寫劇情內容，但不管是在文字表達還是內容細節上還是會有差別，因此我們合併這兩欄，這樣會讓電影劇情內容摘要寫得更全面。

# Combine wiki_plot and imdb_plot into a single column
movies_df['plot'] = movies_df['wiki_plot'].astype(str) + "\n" + \
                 movies_df['imdb_plot'].astype(str)# Inspect the new DataFrame
movies_df.head()

第三步: Tokenization

Tokenization意思是把一段文章或句子，分成一個一個的單字，原因是我們在分析文字時主要會透過程式中的字典分析，字典中或許沒辦法判斷一句話，但是可以判斷一個字。

# Tokenize a paragraph into sentences and store in sent_tokenized
sent_tokenized = [sent for sent in nltk.sent_tokenize("""
                        Today (May 19, 2016) is his only daughter's wedding. 
                        Vito Corleone is the Godfather.
                        """)]# Word Tokenize first sentence from sent_tokenized, save as words_tokenized
words_tokenized = [word for word in nltk.word_tokenize(sent_tokenized[0])]# Remove tokens that do not contain any letters from words_tokenized
import refiltered = [word for word in words_tokenized if re.search('[a-zA-Z]', word)]# Display filtered words to observe words after tokenization
filtered

說明:

1. 首先我們做個示範，這裡有一段劇情描述

Today (May 19, 2016) is his only daughter’s wedding.
Vito Corleone is the Godfather.

2. 利用nltk工具sent_tokenize切割句子，出來的結果會是一個List，包含兩句話，因為nltk默認用句點區分。結果如下

['Today (May 19, 2016) is his only daughter’s wedding.', 'Vito Corleone is the Godfather.']

3. 現在我們拿List中第0個位置也就是第一句Today…wedding.這句話來進行切割成一個一個單字。

['Today', '(', 'May', '19', ',', '2016', ')', 'is', 'his', 'only', 'daughter', "'s", 'wedding', '.']

是切割了沒錯，但有許多不必要的符號或數字我們得去除

4. 利用re(regular expression) 來篩選及過濾文字，這邊我們篩選出只含問字的部分re.search(‘[a-zA-Z]’, word)，以下是只含文字的結果。

['Today', 'May', 'is', 'his', 'only', 'daughter', "'s", 'wedding']

第四步: Stemmer

我們都知道字有不同的形式，名詞動詞形容詞複詞等等，但其實都是同一個字同一個意思的變化，因此我們運用的Stemming轉換字的技巧

# Import the SnowballStemmer to perform stemming
from nltk.stem.snowball import SnowballStemmer# Create an English language SnowballStemmer object
stemmer = SnowballStemmer("english")# Print filtered to observe words without stemming
print("Without stemming: ", filtered)# Stem the words from filtered and store in stemmed_words
stemmed_words = [stemmer.stem(word) for word in filtered]# Print the stemmed_words to observe words after stemming
print("After stemming:   ", stemmed_words)

來看看差別吧

Without stemming: [‘Today’, ‘May’, ‘is’, ‘his’, ‘only’, ‘daughter’, “‘s”, ‘wedding’]
After stemming: [‘today’, ‘may’, ‘is’, ‘his’, ‘onli’, ‘daughter’, “‘s”, ‘wed’]

第五步: 基於上述兩個步驟建立function

# Define a function to perform both stemming and tokenization
def tokenize_and_stem(text):
    
    # Tokenize by sentence, then by word
    tokens = [word for sentence in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sentence)]
    
    # Filter out raw tokens to remove noise
    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]
    
    # Stem the filtered_tokens
    stems = [stemmer.stem(token) for token in filtered_tokens]
    
    return stemswords_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)

第六步: Create TfidfVectorizer

我們現在要將文字轉變成數字好讓電腦理解，其中一個方法叫CountVectorizer，利用計算文字出現的次數來判定個別單字的重要程度。但這會出現問題，例如有些出現很普遍的字"the"雖然出現很多次但其實沒啥意義。所以出現了改良版工具TF-IDF Vectorizer，舉例來說一個有關科技的電影中，"機器人"這個詞出現很多次，但在全部100部電影中出現的比例就不高，但我們知道機器人對於那部科技電影是重要的字。

總之TF-IDF可以當作一個良好判定單字重要程度的模型，我們要利用TF-IDF Vectorizer將文字轉變成數字，稱為文字向量化。

# Import TfidfVectorizer to create TF-IDF vectorsfrom sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer# Instantiate TfidfVectorizer object with stopwords and tokenizer
# parameters for efficient processing of text
tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=200000,
                                 min_df=0.2, stop_words='english',
                                 use_idf=True, tokenizer=tokenize_and_stem,
                                 ngram_range=(1,3))

這邊我們建立一個TF-IDF模型，設定相關參數。

max_df: 這邊設定小數，意思是文字在文章中出現的比例，如果是設定整數，就是文字出現次數的上限次數

min_df: 跟max_df意思相反，就是一定要出現特定比例次數

max_features: 代表字典(Vocabulary)的大小，會根據最大上限依據重要程度排列組成字典，字典功用是例如一句話要透過字典顯示文字位置，顯示的數字代表重要程度的權重。

Stop word: 去除無異議的字例如the i there will what等等。

tokenizer: 這邊放剛剛建立好的tokenize and stem function

第七步: 將文字資料與模形配置與轉換

# Fit and transform the tfidf_vectorizer with the "plot" of each movie
# to create a vector representation of the plot summaries
tfidf_matrix = tfidf_vectorizer.fit_transform([x for x in movies_df["plot"]])print(tfidf_matrix.shape)

結果為一個tfidf矩陣(100, 564)

意思是有100部電影，字典有564個單字，所以每部電影的文字內容摘要都可以用564個單字(feature)轉換成向量。

第八步: KMeans分類:

KMeans屬於機器學習中非監督式學習，是將數據分類的一種演算法，如同字面上意思，我們可以決定分成K個群集，並依照所有數據的Means代入演算法分群。

# Import k-means to perform clusters
from sklearn.cluster import KMeans# Create a KMeans object with 5 clusters and save as km
km = KMeans(n_clusters=5)# Fit the k-means object with tfidf_matrix
km.fit(tfidf_matrix)clusters = km.labels_.tolist()# Create a column cluster to denote the generated cluster for each movie
movies_df["cluster"] = clusters# Display number of films per cluster (clusters from 0 to 4)
movies_df['cluster'].value_counts()

底下為結果

可以看到一百部電影，有58部被分在第二組，17部被分在第四組依此類推。

第九步: Calculate similarity distance

計算各個劇情間的相似程度，我們可以知道向量在二為平面可以想像成兩個箭頭方向，彼此端點的連線可以利用cosine計算相差的距離，1-相差的距離=相似程度。

# Import cosine_similarity to calculate similarity of movie plots
from sklearn.metrics.pairwise import cosine_similarity# Calculate the similarity distance
similarity_distance = 1 - cosine_similarity(tfidf_matrix)

第十步: 另一個分群方法 (hierarchical clustering):依照相似程度畫術狀圖

上一步驟已利用 similarity distance計算各個電影的相似程度，現在要繪製成圖表。

導入工具

# Import matplotlib.pyplot for plotting graphs
import matplotlib.pyplot as plt# Configure matplotlib to display the output inline
%matplotlib inline# Import modules necessary to plot dendrogram
from scipy.cluster.hierarchy import linkage, dendrogram

畫圖

# Create mergings matrix 
mergings = linkage(similarity_distance, method='complete')# Plot the dendrogram, using title as label column
dendrogram_ = dendrogram(mergings,
               labels=[x for x in movies_df["title"]],
               leaf_rotation=90,
               leaf_font_size=16,
)# Adjust the plot
fig = plt.gcf()
_ = [lbl.set_color('r') for lbl in plt.gca().get_xmajorticklabels()]
fig.set_size_inches(108, 21)# Show the plotted dendrogram
plt.show()

簡單來說就是依照相似程度二分法分類。

結論:

舉個簡單的例子來說明，跟電影Braveheart最相近的電影是?

2. 這次主要是先處理文字，依照tfidf轉化成向量

再用兩種方法KMeans跟Similarity distance將電影分類。

未來或許可以依照顧客回復的留言內容或常搜尋的內容判斷顧客的喜好去作電影廣告的推播。

參考資料:
DataCamp — Project: Find Movie Similarity from Plot Summaries https://learn.datacamp.com/projects/648 created by Anubhav Singh Founder at The Code Foundation