Build a machine learning classifier to know who’s Tweeting? Trump or Trudeau?
利用機器學習判斷是川普(Trump)還是特魯多(Trudeau)發的推特
目標:
今天是進行一個模型預測的步驟,建立機器學習模型,然後來區分哪些推特是川普還是特魯多發的。
第一步: 導入需要用到的模塊
這次會使用的模型有MultinomialNB & LinearSVC
# Set seed for reproducibility
import random; random.seed(53)# Import all we need from sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import metrics
第二步: 匯入資料
資料來自2017年11月的推特內容,接著把資料分成training(67%)跟testing data(33%)。
import pandas as pd# Load data
tweet_df = pd.read_csv('datasets/tweets.csv')# Create target
y = tweet_df['author']# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(tweet_df['status'], y, random_state=53, test_size=.33)
第三步: 文字向量化
接著把文字轉成數字
# Initialize count vectorizer
count_vectorizer = CountVectorizer(stop_words='english', max_df=0.9, min_df=0.05)# Create count train and test variables
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)# Initialize tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.9, min_df=0.05)# Create tfidf train and test variables
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)
這邊使用兩個轉換模型Count Vectorizer跟Tf-idf Vectorizer,這兩個模型在之前的文章有詳細的介紹。
CountVectorizer: 利用計算文字出現的次數來判定個別單字的重要程度。但這會出現問題,例如有些出現很普遍的字”the”雖然出現很多次但其實沒啥意義。所以出現了改良版工具TF-IDF Vectorizer,舉例來說一個有關科技的電影中,”機器人”這個詞出現很多次,但在全部100部電影中出現的比例就不高,但我們知道機器人對於那部科技電影是重要的字。
接著我們利用這兩個轉換方法轉換文字。
第四步: 放入Naive Bayes模型
將轉換過後的兩組資料,首先分別餵進Multinomial Naive Bayes model看看結果分別如何。
# Create a MulitnomialNB model
tfidf_nb = MultinomialNB()
tfidf_nb.fit(tfidf_train, y_train)# Run predict on your TF-IDF test data to get your predictions
tfidf_nb_pred = tfidf_nb.predict(tfidf_test)# Calculate the accuracy of your predictions
tfidf_nb_score = metrics.accuracy_score(tfidf_nb_pred, y_test)# Create a MulitnomialNB model
count_nb = MultinomialNB()
count_nb.fit(count_train, y_train)# Run predict on your count test data to get your predictions
count_nb_pred = count_nb.predict(count_test)# Calculate the accuracy of your predictions
count_nb_score = metrics.accuracy_score(count_nb_pred, y_test)print('NaiveBayes Tfidf Score: ', tfidf_nb_score)
print('NaiveBayes Count Score: ', count_nb_score)
NaiveBayes Tfidf Score: 0.803030303030303
NaiveBayes Count Score: 0.7954545454545454
用Tfidf向量化文字放入模型的結果好一點點。
第五步: Confusion Matrix
透過Confusion Matrix 我們可以知道有幾則推文被歸類錯了。
%matplotlib inlinefrom datasets.helper_functions import plot_confusion_matrix# Calculate the confusion matrices for the tfidf_nb model and count_nb models
tfidf_nb_cm = metrics.confusion_matrix(y_test, tfidf_nb_pred, labels=['Donald J. Trump', 'Justin Trudeau'])
count_nb_cm = metrics.confusion_matrix(y_test, count_nb_pred, labels=['Donald J. Trump', 'Justin Trudeau'])# Plot the tfidf_nb_cm confusion matrix
plot_confusion_matrix(tfidf_nb_cm, classes=['Donald J. Trump', 'Justin Trudeau'], title="TF-IDF NB Confusion Matrix")# Plot the count_nb_cm confusion matrix without overwriting the first plot
plot_confusion_matrix(count_nb_cm, classes=['Donald J. Trump', 'Justin Trudeau'], title="Count NB Confusion Matrix", figure=1)
如圖:
第六步: 換成線性SVM看看
# Create a LinearSVM model
tfidf_svc = LinearSVC()
tfidf_svc.fit(tfidf_train, y_train)# Run predict on your tfidf test data to get your predictions
tfidf_svc_pred = tfidf_svc.predict(tfidf_test)# Calculate your accuracy using the metrics module
tfidf_svc_score = metrics.accuracy_score(tfidf_svc_pred, y_test)print("LinearSVC Score: %0.3f" % tfidf_svc_score)# Calculate the confusion matrices for the tfidf_svc model
svc_cm = metrics.confusion_matrix(y_test, tfidf_svc_pred, labels=['Donald J. Trump', 'Justin Trudeau'])# Plot the confusion matrix using the plot_confusion_matrix function
plot_confusion_matrix(svc_cm, classes=['Donald J. Trump', 'Justin Trudeau'], title="TF-IDF LinearSVC Confusion Matrix")
LinearSVC Score: 0.841 又分得更精準了
第七步: 列出重要的字
from datasets.helper_functions import plot_and_return_top_features# Import pprint from pprint
from pprint import pprint# Get the top features using the plot_and_return_top_features function and your top model and tfidf vectorizer
top_features = plot_and_return_top_features(tfidf_svc, tfidf_vectorizer)# pprint the top features
pprint(top_features)
列出一些模型中權重大的字,可以看到一堆法文,應該都是特魯多發的。
註: 一些自訂的function來自https://www.kaggle.com/mmmarchetti/identifying-who-is-tweeting/code
第八步: 模仿川普及特魯多推特,看看模型有沒有辦法正確分辨!
# Write two tweets as strings, one which you want to classify as Trump and one as Trudeau
trump_tweet = 'fake news'
trudeau_tweet = 'canada'# Vectorize each tweet using the TF-IDF vectorizer's transform method
trump_tweet_vectorized = tfidf_vectorizer.transform([trump_tweet])
trudeau_tweet_vectorized = tfidf_vectorizer.transform([trudeau_tweet])# Call the predict method on your vectorized tweets
trump_tweet_pred = tfidf_svc.predict(trump_tweet_vectorized)
trudeau_tweet_pred = tfidf_svc.predict(trudeau_tweet_vectorized)print("Predicted Trump tweet", trump_tweet_pred)
print("Predicted Trudeau tweet", trudeau_tweet_pred)
Predicted Trump tweet [‘Donald J. Trump’]
Predicted Trudeau tweet [‘Justin Trudeau’]
可!
結論:
這次雖然模型正確率高達84%,但這很有可能是在文字前處理時沒有剔除stop words,雖然可能會造成準確率下降,但是可以經由調整參數使模型配飾得更好。
參考資料:
DataCamp — Project: Who’s Tweeting? Trump or Trudeau? https://projects.datacamp.com/projects/467 created by Katharine Jarmul Founder, kjamistan