ลองเล่นๆ : แยก SMS ขยะด้วย Naive Bayes VS Neural Network (ML สู้ NN ไม่ได้จริงๆเหรอ ?)

Published in

mmp-li

3 min readFeb 21, 2019

ML มันจะไปสู้ Deep Learningได้ไง ….. จริงเหรอ ?

วันนี้นั่งอยู่ดีๆไอ้เราก็ดันไปนึกถึงวิชาสถิติสำหรับวิศวะคอม ตอนเรียนเรื่อง Naive Bayes แยก TN, TP , FP และ FN แล้วก็แยกอีเมลล์ขยะด้วย NB (เป็นทฤษฏี) อยู่ดีๆก็อยากจะลองเล่น Naive Bayes ที่เป็น Machine Learning Model สักหน่อย (เหมือนว่าง) แต่พอเล่นไปสักพัก ลองทำอะไรเล่นๆ อยู่ดีๆก็คิดอะไรขำๆได้ว่า บทความที่ผ่านมายังไม่เคยลอง RandomSearchCV เลยนิหว่า….

คล้ายๆ gridsearch แต่มันเร็วกว่าแต่ค่าที่ได้ไม่ค่อยจะเวิร์ค

จุดประสงค์ของมันคือ “ไวและง่าย” สำหรับการปรับ hyper parameter ของพวก ML แต่พอดีช่วงนี้กำลังติด Keras (หนีจาก cntk หนีจาก tensorflow มาลองอะไรง่ายๆ) ไอ้เราก็คิดได้ว่า Keras มันก็ใช้ง่ายเหมือนกันนิหวา แถม Deep Learning ก็น่าจะดีกว่า ML แล้วด้วย แต่มันก็ไม่เสมอไป….

วันนี้จะเป็นการมาทำ Text Classification เป็นส่วนหนึ่งของงานด้าน NLP ใครยังไม่เคยอ่านบทความ NLP ไปอ่านก่อนนะ จะได้เข้าใจอะไรง่ายมากขึ้น

NLP(Natural Language Processing) ศาสตร์(ไม่)ใหม่ ศาสตร์แห่งเจได: แยกประเภทอีเมลล์ด้วยพลังฟอร์ซ

Be the force be with you…. อ่าวผิดเรื่อง

medium.com

และสำหรับบทความ “ลองเล่นๆ” เราก็จะลงมือทำเลย ไม่มี intro ไม่มีทฤษฏีอะไรใดๆทั้งสิ้น….

เริ่ม !

โหลด dataset จาก kaggle ก่อน (จะต้องล็อคอินก่อนถึงจะโหลดได้)

SMS Spam Collection Dataset

Collection of SMS messages tagged as spam or legitimate

www.kaggle.com

ทีนี้เริ่ม !

เรียก lib ที่จะใช้ทั้งหมดออกมา

import numpy
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import confusion_matrix, f1_score ,accuracy_score

ทีนี้เปิดไฟล์ขึ้นมาแล้วจัดการข้อมูลนิดหน่อย

file = pd.read_csv(‘spam.csv’)file.drop_duplicates(inplace=True)
file.dropna(inplace=True)

ลองดูว่าเรามีข้อมูลเท่าไหร่

print(‘Ham :’,file[file[‘v1’]==’ham’].shape)
print(‘Spam :’,file[file[‘v1’]==’spam’].shape)

ก็จะพบว่า

แบ่งข้อมูลออกมาก่อน

arr_text=file.iloc[:][‘v2’].values
arr_class=file.iloc[:][‘v1’].valuesfrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(arr_text, arr_class, test_size=0.33, random_state=42)

ทีนี้เรียกใช้ “การนับความถี่” ของคำที่ปรากฏขึ้นมาทั้งหมด

vect = CountVectorizer()vect.fit(arr_text)

ทีนี้ก็ฝึกโมเดล

train_transformer_x = vect.transform(X_train)detect_model = MultinomialNB().fit(train_transformer_x,y_train)

ทีนี้ลองทดสอบข้อมูล

test_transformer_x = vect.transform(X_test)y_pred = detect_model.predict(test_transformer_x)print(‘F1 score =’,f1_score(y_test, y_pred, average=’macro’))
print(‘Accuracy =’,accuracy_score(y_test, y_pred))

ก็จะพบว่าความแม่นยำและถูกต้องอยู่ที่

แต่เราจะมาพิสูจน์กันว่า GridSearchCV กับ RandomSearchCV มันต่างกันขนาดไหน

ลองกับ RandomsearchCV ก่อนอย่างแรก

from sklearn.model_selection import RandomizedSearchCVparameters = {
 ‘alpha’ : numpy.arange(0, 8, 0.01),
 ‘fit_prior’ : [True,False]
}
rdsCV = RandomizedSearchCV(MultinomialNB(),param_distributions=parameters,cv=10)
rdsCV.fit(train_transformer_x,y_train)
print(‘Best Parameter = ‘,rdsCV.best_params_)

จะได้ความแม่นยำที่

ทีนี้ลองกับ gridsearchCV

from sklearn.model_selection import GridSearchCVparameters = {
 ‘alpha’ : numpy.arange(0, 8, 0.01),
 ‘fit_prior’ : [True,False]
}
rdsCV = RandomizedSearchCV(MultinomialNB(),param_distributions=parameters,cv=10)
rdsCV.fit(train_transformer_x,y_train)
print(‘Best Parameter = ‘,rdsCV.best_params_)

เพราะ randomsearch บางทีมันก็ข้ามข้อมูลที่สำคัญออกไปทำให้มันอาจจะไม่ใช่ตัวเลือกที่ดีที่สุด ถ้าต้องการความแม่นยำที่แท้จริงควรไปใช้ gridsearchCV

ทีนี้ลองกับ Deep Learning บางเพื่อพิสูจน์ว่าบางงาน DL มันไม่ใช่คำตอบของทุกๆอย่างหรอกนะ #อย่างหล่อ

import kerasfrom keras.models import Sequential
from keras.layers import Dense
from keras.utils import to_categorical

ทำการเปลี่ยนจากคำว่า “ham” กับ “Spam” เป็นตัวเลขก่อนเอาไปเข้าโมเดล

from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(arr_class)y_train = le.transform(y_train)
y_test = le.transform(y_test)y_train = to_categorical(y_train)
y_test = to_categorical(y_test)print(y_train.shape)

ทีนี้ให้เราดูค่า shape ตำแหน่งที่ 2 ของ y_XXX (ตัวไหนก็ได้) เอาไว้เพราะจะเอาไว้ใช้ในตอนสร้างโมเดล

จะต้องสังเกตสองอย่างคือ input_shape = (XXX,) คือจำนวน type ของข้อมูล X ของเราส่วนอีกอย่างคือ model.add ตัวล่างสุด จะต้องใส่ shape ของ แกน y หรือก็คือจำนวน type ของข้อมูลที่เรามีนั้นเอง !!

def classification_model():
 # create model
 model = Sequential()
 model.add(Dense(1024, activation=’relu’, input_shape=(train_transformer_x.shape[1],)))
 model.add(Dense(1024, activation=’relu’))
 model.add(Dense(1024, activation=’relu’))
 model.add(Dense(1024, activation=’relu’))
 model.add(Dense(y_train.shape[1], activation=’softmax’))
 
 
 # compile model
 model.compile(optimizer=’adam’, loss=’categorical_crossentropy’, metrics=[‘accuracy’])
 return model

ทีนี้ทำการฝึกฝน

# build the model
model = classification_model()# fit the model
model.fit(train_transformer_x, y_train, validation_data=(test_transformer_x, y_test), epochs=10, verbose=2)

แล้วลองทำการประเมินโมเดล

value = model.predict(test_transformer_x)
y_pred = numpy.argmax(value,axis=1)
y_true = numpy.argmax(y_test,axis=1)print(‘F1 score =’,f1_score(y_true, y_pred, average=’macro’))
print(‘Accuracy =’,accuracy_score(y_true, y_pred))

บทสรุป

NN แม่นกว่า จบ

หลอกๆๆ 5555

ถ้าเกิดใครที่ใช้ kaggle เป็นแล้วลองดูเฉลยที่คนอื่นทำกันจะพบว่าความแม่นยำสามารถไปได้สูงสุดถึง 0.99 % !!!! เพราะเลือกที่จะปรับค่า/จูนค่า และเลือกใช้โมเดล LSTM (Deep Learning อีกตัว) แต่เราจะสังเกตุว่าจะมีอยู่ 3 อย่างที่เราเห็นได้ชัด

ลองใช้ naive bayes แบบเบสิคไม่ทำอะไรเลย ไม่ปรับอะไรเลย ค่าที่ออกมาก็ใช้งานได้แล้ว (0.98%)
ลองปรับด้วย randomsearch ไม่เวิร์คเพราะมันข้ามบางอย่างที่สำคัญ มันเลยไม่ได้ค่าที่สูงสุด แต่ไวกว่า gridsearch มากๆ
Deep Learning ได้ค่าแม่นมากสุด แต่ใช้เวลาและความรู้มากที่สุด ถ้าไม่แม่นยำจริงๆหรือปรับโมเดลไม่เป็นก็ไม่ได้ค่าที่แม่นยำจริงๆ

จะเห็นได้ว่า ML ก็สามารถสู้กับ NN ได้แม้จะเก่ากว่าแต่ก็ยังสามารถใช้งานได้ ถ้ารู้จักปรับค่า และเลือกใช้โมเดลให้ถูกต้องนั้นเอง

Medium : https://medium.com/@pingloaf

Linkedin : https://www.linkedin.com/in/peerat-limkonchotiwat/

บทความนี้เป็นส่วนหนึ่งของบทความ

เริ่มเรียน Machine/Deep Learning 0–100 (Introduction)

เริ่มเรียน Machine Learning 0–100 zero to Mr.incredible (Introduction)

Machine Learning ไม่ยากเหมือนที่คิด