Build a machine learning classifier to know who’s Tweeting? Trump or Trudeau?

利用機器學習判斷是川普(Trump)還是特魯多(Trudeau)發的推特

Sharko Shen

Published in

Data Science & LeetCode for Kindergarten

12 min readApr 26, 2020

目標:

今天是進行一個模型預測的步驟，建立機器學習模型，然後來區分哪些推特是川普還是特魯多發的。

來源: https://commons.wikimedia.org/wiki/File:President_Donald_Trump_and_Prime_Minister_Justin_Trudeau_Joint_Press_Conference,_February_13,_2017.jpg

第一步: 導入需要用到的模塊

這次會使用的模型有MultinomialNB & LinearSVC

# Set seed for reproducibility
import random; random.seed(53)# Import all we need from sklearn
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn import metrics

第二步: 匯入資料

資料來自2017年11月的推特內容，接著把資料分成training(67%)跟testing data(33%)。

import pandas as pd# Load data
tweet_df = pd.read_csv('datasets/tweets.csv')# Create target
y = tweet_df['author']# Split training and testing data
X_train, X_test, y_train, y_test = train_test_split(tweet_df['status'], y, random_state=53, test_size=.33)

第三步: 文字向量化

接著把文字轉成數字

# Initialize count vectorizer
count_vectorizer = CountVectorizer(stop_words='english', max_df=0.9, min_df=0.05)# Create count train and test variables
count_train = count_vectorizer.fit_transform(X_train)
count_test = count_vectorizer.transform(X_test)# Initialize tfidf vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.9, min_df=0.05)# Create tfidf train and test variables
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

這邊使用兩個轉換模型Count Vectorizer跟Tf-idf Vectorizer，這兩個模型在之前的文章有詳細的介紹。

透過自然語言分析技術(NLP)，利用電影劇情大綱文字，探勘電影之間的相似程度

透過NLP分析電影劇情大綱文字，量化電影之間的相似度

medium.com

CountVectorizer: 利用計算文字出現的次數來判定個別單字的重要程度。但這會出現問題，例如有些出現很普遍的字”the”雖然出現很多次但其實沒啥意義。所以出現了改良版工具TF-IDF Vectorizer，舉例來說一個有關科技的電影中，”機器人”這個詞出現很多次，但在全部100部電影中出現的比例就不高，但我們知道機器人對於那部科技電影是重要的字。

接著我們利用這兩個轉換方法轉換文字。

第四步: 放入Naive Bayes模型

將轉換過後的兩組資料，首先分別餵進Multinomial Naive Bayes model看看結果分別如何。

# Create a MulitnomialNB model
tfidf_nb = MultinomialNB()
tfidf_nb.fit(tfidf_train, y_train)# Run predict on your TF-IDF test data to get your predictions
tfidf_nb_pred = tfidf_nb.predict(tfidf_test)# Calculate the accuracy of your predictions
tfidf_nb_score = metrics.accuracy_score(tfidf_nb_pred, y_test)# Create a MulitnomialNB model
count_nb = MultinomialNB()
count_nb.fit(count_train, y_train)# Run predict on your count test data to get your predictions
count_nb_pred = count_nb.predict(count_test)# Calculate the accuracy of your predictions
count_nb_score = metrics.accuracy_score(count_nb_pred, y_test)print('NaiveBayes Tfidf Score: ', tfidf_nb_score)
print('NaiveBayes Count Score: ', count_nb_score)

NaiveBayes Tfidf Score: 0.803030303030303
NaiveBayes Count Score: 0.7954545454545454

用Tfidf向量化文字放入模型的結果好一點點。

第五步: Confusion Matrix

透過Confusion Matrix 我們可以知道有幾則推文被歸類錯了。

%matplotlib inlinefrom datasets.helper_functions import plot_confusion_matrix# Calculate the confusion matrices for the tfidf_nb model and count_nb models
tfidf_nb_cm = metrics.confusion_matrix(y_test, tfidf_nb_pred, labels=['Donald J. Trump', 'Justin Trudeau'])
count_nb_cm = metrics.confusion_matrix(y_test, count_nb_pred, labels=['Donald J. Trump', 'Justin Trudeau'])# Plot the tfidf_nb_cm confusion matrix
plot_confusion_matrix(tfidf_nb_cm, classes=['Donald J. Trump', 'Justin Trudeau'], title="TF-IDF NB Confusion Matrix")# Plot the count_nb_cm confusion matrix without overwriting the first plot 
plot_confusion_matrix(count_nb_cm, classes=['Donald J. Trump', 'Justin Trudeau'], title="Count NB Confusion Matrix", figure=1)

如圖:

第六步: 換成線性SVM看看

# Create a LinearSVM model
tfidf_svc = LinearSVC()
tfidf_svc.fit(tfidf_train, y_train)# Run predict on your tfidf test data to get your predictions
tfidf_svc_pred = tfidf_svc.predict(tfidf_test)# Calculate your accuracy using the metrics module
tfidf_svc_score = metrics.accuracy_score(tfidf_svc_pred, y_test)print("LinearSVC Score:   %0.3f" % tfidf_svc_score)# Calculate the confusion matrices for the tfidf_svc model
svc_cm = metrics.confusion_matrix(y_test, tfidf_svc_pred, labels=['Donald J. Trump', 'Justin Trudeau'])# Plot the confusion matrix using the plot_confusion_matrix function
plot_confusion_matrix(svc_cm, classes=['Donald J. Trump', 'Justin Trudeau'], title="TF-IDF LinearSVC Confusion Matrix")

LinearSVC Score: 0.841 又分得更精準了

第七步: 列出重要的字

from datasets.helper_functions import plot_and_return_top_features# Import pprint from pprint
from pprint import pprint# Get the top features using the plot_and_return_top_features function and your top model and tfidf vectorizer
top_features = plot_and_return_top_features(tfidf_svc, tfidf_vectorizer)# pprint the top features
pprint(top_features)

列出一些模型中權重大的字，可以看到一堆法文，應該都是特魯多發的。

註: 一些自訂的function來自https://www.kaggle.com/mmmarchetti/identifying-who-is-tweeting/code

第八步: 模仿川普及特魯多推特，看看模型有沒有辦法正確分辨!

# Write two tweets as strings, one which you want to classify as Trump and one as Trudeau
trump_tweet = 'fake news'
trudeau_tweet = 'canada'# Vectorize each tweet using the TF-IDF vectorizer's transform method
trump_tweet_vectorized = tfidf_vectorizer.transform([trump_tweet])
trudeau_tweet_vectorized = tfidf_vectorizer.transform([trudeau_tweet])# Call the predict method on your vectorized tweets
trump_tweet_pred = tfidf_svc.predict(trump_tweet_vectorized)
trudeau_tweet_pred = tfidf_svc.predict(trudeau_tweet_vectorized)print("Predicted Trump tweet", trump_tweet_pred)
print("Predicted Trudeau tweet", trudeau_tweet_pred)

Predicted Trump tweet [‘Donald J. Trump’]
Predicted Trudeau tweet [‘Justin Trudeau’]

可!

結論:

這次雖然模型正確率高達84%，但這很有可能是在文字前處理時沒有剔除stop words，雖然可能會造成準確率下降，但是可以經由調整參數使模型配飾得更好。

參考資料:
DataCamp — Project: Who’s Tweeting? Trump or Trudeau? https://projects.datacamp.com/projects/467 created by Katharine Jarmul Founder, kjamistan