Automated Driving Crashes Model Part 2

9 min readDec 14, 2022

National Highway Traffic Safety Administration

แบบจำลองความเสี่ยงการเกิดอุบัติเหตุของ Automated Driving หรือยานยนต์ขับเคลื่อนอัตโนมัติ

การจัดทำขึ้นในครั้งนี้เป็นส่วนหนึ่งของ Final Project รายวิชา DS511 Data Science ปีการศึกษา 2565

มาต่อจาก Part 1 ของเรา สามารถติดตามอ่านได้ที่ วิเคราะห์การขัดข้องของรถยนต์ระบบ ADS และ ADAS Part 1

หากมีข้อมูลผิดพลาดประการใดต้องขออภัยไว้ ณ ที่นี้ด้วยนะคะ ซึ่งทุกท่านสามารถแนะนำความรู้เพิ่มเติมหรือแนะนำแนวทางที่เหมาะสมเลยค่ะ (เกรงว่าการเลือกฟีเจอร์ของเราจะไม่ค่อยดีเท่าไหร่ 🥲)

ในส่วนของ dataset เราจะขอเริ่มเกริ่นที่มาอีกครั้ง เรานำมาใช้นั้นเป็นข้อมูล การขัดข้องของรถยนต์ระบบ ADS และ ADAS ที่มาจาก National Highway Traffic Safety Administration https://www.nhtsa.gov/laws-regulations/standing-general-order-crash-reporting โดยทาง National Highway Traffic Safety Administration (NHTSA) หรือ สํานักงานความปลอดภัยการจราจรบนทางหลวงแห่งชาติ นั้นได้ออกคำสั่ง (General Order) กำหนดให้ผู้ผลิตและผู้ประกอบการขายยานพาหนะ ต้องรายงานการชนที่เกี่ยวข้องกับกับความช่วยเหลือผู้ขับขี่ขั้นสูงหรือ “ระบบการขับขี่อัตโนมัติ” ยานพาหนะขับเคลื่อนอัตโนมัติ โดยคำสั่งนี้จะช่วยให้ NHTSA แจ้งเตือนการขัดข้องของรถยนต์ระบบ ADS และ ADAS แก่ผู้ใช้งานได้อย่างทันท่วงทีและโปร่งใส ซึ่งหาก NHTSA พบข้อบกพร่องด้านความปลอดภัยจะดำเนินการเพื่อให้แน่ใจว่ามีการนำยานพาหนะที่ไม่ปลอดภัยออกจากถนนสาธารณะหรือแก้ไขตามความเหมาะสมต่อไป

สามารถอ่านรายละเอียดเกี่ยวกับแต่ละระดับได้ที่ https://motortrivia.com/2017/01/autonomous-car-sae-classification/
ได้เลยค่ะ เนื่องจากเขียนไว้ได้ละเอียดมากกก

ประโยชน์ของแบบจำลองความเสียจากอุบัติเหตุของ Automated Driving

ซึ่งหาโมเดลทำออกมาได้ดีสิ่งนี้ก็จะเป็นประโยชน์ต่อผู้ที่กำลังพิจารณารถขับเคลื่อนอัตโนมัติอย่างมาก เพราะเราจะสามารถเข้าใจถึงข้อขัดข้องของมันได้ดี และตรวจเช็คอย่างต่อเนื่องเพื่อความปลอดภัยของผู้ขับขี่
สามารถช่วยให้เราประเมินจุดที่คาดว่าจะเกิดเหตุบ่อยได้ ดังนั้น เมื่อเราประเมินได้แล้วเราสามารถนำการประเมินนั้นมาแก้ปัญหาได้ทัน เนื่องจาก Automated Driving ก็ถือเป็นเรื่องใหม่อยู่เหมือนกันเพราะปัจจุบันไม่ได้มีในทุกประเทศ และอย่างสหรัฐอเมริกาเองหลาย ๆ บริษัทก็ยังอยู่ระหว่างการทดสอบ

เริ่มกันเลย ในพาร์ทนี้เราพูดถึงตั้งแต่การเตรียมโมเดลเพื่อนำมาทำโมเดลสำหรับ machine learning (ML) ไปจนถึงลองสร้าง Model ว่าเหมาะกับข้อมลของเราหรือไม่!!

สามารถดูโค้ดต่าง ๆ ที่ละเอียดมากยิ่งขึ้นที่ Colab Automated Driving Crashes Model

เตรียมข้อมูลสำหรับ Machine Learning ต้องมีอะไร?

1. Problem formulation

2. Data collection and discovery

3. Data exploration

4. Data cleansing and validation

5. Data structuring

6. Feature engineering and selection

เนื่องจาก dataset ที่เราได้มานั้นยังไม่พร้อมสำหรับการทำ Model เราจึงต้องเตรียมข้อมูลก่อน ทั้งนี้ก็สามารถดูโค้ดใน Colab ได้เลย

Select Feature

เมื่อเราเตรียมข้อมูลเสร็จแล้วเราจะมาเลือก Feature สำหรับโมเดลของเรา เนื่องจากข้อมูลของเรานั้นมีเยอะมาก เราจึงต้องเลือก Feature ที่เหมาะสมกับแบบจำลองของเรา (งานนี้แหละถ้าผิดพลาดต้องขออภัยอีกครั้งนะคะ)

เราลอง Plot heatmap เพื่อแสดงค่าสหสัมพันธ์ (Correlation) ในรูปแบบ visualization และดูว่า Feature ใดที่มี Correlation

corr = Auto_Drive.corr()
sns.heatmap(round(corr,2), annot=True)
plt.show()

จากข้อมูลที่เรามีอาจจะยังมองไม่เห็นความสัมพันธ์ที่เหมาะสมมากนั้นเราจะลองพล็อตแบบอื่นเพิ่ม

ทั้งนี้ จากการพิจารณาข้อมูลในตารางเราจะเลือกให้คอลัมน์ Property Damage? เป็น Label ของ Model หรือผลลัพท์ของการทำนายนั้นเอา โดย Label จะถูกแทนค่าไปยังค่า y

ลอง countplot เพื่อดูค่า ของ Feature ที่เราเลือกมาเทียบกับ Label ของเราว่าและค่าใน Roadway Type นั้นมีความแตกต่างกันมากน้อยเพียงใด และเหมาะกับนำมาทำโมเดลหรือไม่

ลองเทียบหลาย ๆ อันเพื่อดูความต่างว่าเหมาะสมและสอดคล้องกับ Label ของเราหรือไม่ สามารถดูเพิ่มเติมได้ใน Colab

2. Groupby เพื่อดูค่าที่อยู่ในแต่ละคอลัมน์

#ดูข้อมููลในคอลัมน์ว่ามีอะไรบ้าง ด้วยการ Group count
Property = data1.groupby(['Property Damage?'])['Property Damage?'].count()
Property


ouyput: 

Property Damage?
No           74
Unknown     489
Yes        1038
Name: Property Damage?, dtype: int64

พบค่า Unknown เยอะมาก ๆ ซึ่งจึงเราเลือกที่จะลบทิ้ง เนื่องจากคอลัมน์นี้เป็นรายงานความเสียหายที่เป็นค่า Yes กับ No เท่านั้น

#ลบค่า Unknow
df_Property = data1[data1['Property Damage?'].str.contains('Unknown')==False]

Roadway_s = df_Property.groupby(['Roadway Surface'])['Roadway Surface'].count()
Roadway_s

Output:

Roadway Surface
Dry                     647
Other, see Narrative      2
Snow / Slush / Ice        5
Unknown                 385
Wet                      73
Name: Roadway Surface, dtype: int64

ซึ่งหากดูจาก countplot และ Groupby พบว่าคอลัมน์อื่นส่วนใหญ่จะพบค่า Unknow จำนวนมากด้วยเช่นกัน โดยเราตรวจสอบคอลัมน์อื่น ๆ ให้ครบและดำเนินการลบออก

ดำเนินการลบค่าที่มีคำว่า Unknow ในคอลัมน์ด้วยโค้ดนี้เช่นเดียวกับด้านบน

df_Property1 = df_Property[df_Property['Roadway Surface'].str.contains('Unknown')==False]

ต่อมาหากเราสังเกตุ จะว่า dataframe ของเราคอลัมน์ Weather ต่าง ๆ จะมีสภาพอากาศที่แตกต่างกันไปโดยแยกอยู่คนละคอลัมน์อยู่แล้ว โดยหากวันที่เกิดอุบัติเหตุมีสภาพอากาศเป็นอย่างไรจะมีค่า Y อยู่

ดังนั้น เพื่อให้ง่ายต่อการสร้างโมเดลเราจะกำหนดค่า Y ที่อยู่ใน Weather ต่าง ๆ ให้เป็น 1 และที่เป็นค่า ‘__’ ให้เป็น 0

df_Property0.replace(to_replace={
    'Weather - Clear':{' ': 0, 'Y': 1}, 
    'Weather - Snow':{' ': 0, 'Y': 1},
    'Weather - Cloudy':{' ': 0, 'Y': 1},
    'Weather - Rain':{' ': 0, 'Y': 1}
                 }, inplace=True)

นอกจากนี้ เพื่อลดปรับหาต่าง ๆ ของโมเดลเราจะเป็นค่าในคอลัมน์ label ของเราที่ Property Damage? เป็น Yes = 1 และ No = 0 ด้วย

ส่วนใหญ่เตรียมข้อมูลคร่าวก็จะประมาณนี้ค่ะ

OneHotEncoder on multiple categorical columns

เข้าสู่กระบวนการ OneHotEncoder โดยเราจะใช้ preprocessing ของ sklearn เพื่อแปลง Data catagory ไปเป็นตัวเลข

from sklearn.preprocessing import LabelEncoder
# instantiate labelencoder object
le = LabelEncoder()

# apply le on categorical feature columns
df_formodel[['Roadway Type', 'Roadway Surface', 'Lighting']] = df_formodel[['Roadway Type', 'Roadway Surface', 'Lighting']].apply(lambda col: le.fit_transform(col))    
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()

#One-hot-encode the categorical columns.
#Unfortunately outputs an array instead of dataframe.
array_hot_encoded = ohe.fit_transform(df_formodel[['Roadway Type', 'Roadway Surface', 'Lighting', 'Weather - Clear', 'Weather - Snow', 'Weather - Cloudy',
       'Weather - Rain']])

#Convert it to df
data_hot_encoded = pd.DataFrame(array_hot_encoded, index=df_formodel.index)

#Extract only the columns that didnt need to be encoded
data_other_cols = df_formodel.drop(columns=['Model Year' ])

#Concatenate the two dataframes : 
data_out = pd.concat([data_hot_encoded, data_other_cols], axis=1)

แสดงค่า Corr อีกครั้ง เราจะนำ Dataframe นี้ละไปทำการ Train model

ถึงเวลาเริ่มในส่วนของโมเดลแล้ว

โมเดลแรกที่เราจะใช้ คือ KNeighborsClassifier!

KNeighborsClassifier

K-Nearest Neighbour Algorithm หรือ ที่เรารู้จักในนาม knn เป็นวิธีที่ใช้ในการจัดแบ่งคลาส โดยเทคนิคนี้จะตัดสินใจว่า คลาสใดที่จะแทนเงื่อนไขหรือกรณีใหม่ ๆ โดยจะหาผลรวม (Count Up) ของจำนวนเงื่อนไข หรือกรณีต่าง ๆ โดยสำหรับแต่ละคลาส เราจะกำหนดเงื่อนไขใหม่ให้คลาสที่เหมือนกันกับคลาสที่ใกล้เคียงกันมากที่สุด

อย่าลืม Import Libary!

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression

สำหรับการทำโมเดลนั้น สิ่งสำคัญเราจะต้องทำการ train_test_split ข้อมูลของเราก่อน

test_size= 2
X_train, X_test, y_train, y_test = train_test_split(
    data_out[['Mileage',     'Roadway Type',
        'Roadway Surface',         'Lighting',  'Weather - Clear',
         'Weather - Snow', 'Weather - Cloudy',   'Weather - Rain']], 
    data_out['Property Damage?'], 
    test_size=test_size, random_state=7) #random_state เราจะใช้ค่าอะไรก้ได้

model = KNeighborsClassifier()
model

#การ fit เป็นการส่งให้ machin learning เรียนรู้
model.fit(X_train, y_train)

model.score(X_train, y_train)

Output:

0.9511111111111111

เราจะลอง predict ถ้าได้ 1 คือได้รับความเสียหาย แต่ถ้าได้ 0 เท่ากับไม่ได้รับความเสียหาย

model.predict([
    [4, 0, 1, 1, 0, 0, 0, 1],
    [4, 0, 1, 1, 0, 0, 0, 1]
              ])

Output:

array([1, 1])
#หากเทียบกับ df ใน colab แล้ว ซึ่งก็ทายถูก

แสดงค่าที่ทำนายผิดเทียบกับข้อมูลจริงออกมาเป็น Dataframe

dx=pd.DataFrame({'y_true': y_train, 'y_pred': predicted})
#แสดงค่าที่ทายผิด 
dx[dx.y_true != dx.y_pred]

หน้าตาของ Dataframe

confusion matrix สำหรับ KNeighborsClassifier

วัดผลจากคะแนนต่าง ๆ เช่น accuracy score, Precision, Recall และ F1_score ของ KNN model

print(confusion_matrix(y_train, predicted))

Output:

[[ 21  22]
 [ 11 621]]

โดยแนวทะแยงลงมาฝั่งขวาคือ Predict ได้ถูก สามารถดู confusion matrix เพิ่มเติมได้ที่รูปภาพด้านล่างนี้ TP คือที่เราทายถูก ดังนั้นเราทายถูก 621 ค่า ผิดไป 22 ค่า

https://subscription.packtpub.com/book/big-data-and-business-intelligence/9781838555078/6/ch06lvl1sec34/confusion-matrix

Confusion โดย LogisticRegression

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

lr = LogisticRegression(solver ='liblinear').fit(X, y)
lr_predicted = lr.predict(X)
confusion = confusion_matrix(y, lr_predicted)

print('Logistic regression classifier (default settings)\n', confusion)
tn, fp, fn, tp =confusion.ravel() #ravel แปลงให้เป็น 1 มิติ
print(f'TP: {tp}, FP: {fp}, TN:{tn}, FN:{fn}')

Output:

Logistic regression classifier (default settings)
 [[  0  43]
 [  0 634]]
TP: 634, FP: 43, TN:0, FN:0

print(lr_predicted)

หากต้องการให้มันกลายเป็น Probaility สามารถใช้ lr.predict_proba แทนได้ โดยจะออกมาเป็น 2 คลาส คลาส 0 กับ 1 เบื้องต้นเราจะใช้ predict ก่อนเพราะเรายังไม่ได้เซ็คค่า threshold

สรุปว่า Logistic Regression ทายถูกหมดเลย (underfitting มาก)

DecisionTreeClassifier

from sklearn.tree import DecisionTreeClassifier #เป็นอีกโมเดล

dt = DecisionTreeClassifier(max_depth=2).fit(X_train, y_train)
tree_predicted = dt.predict(X_train)
confusion = confusion_matrix(y_train, tree_predicted)

print('Decision tree classifier (max_depth = 2)\n', confusion)
tn, fp, fn, tp =confusion.ravel()
print(f'TP: {tp}, FP: {fp}, TN:{tn}, FN:{fn}')

Output:

Decision tree classifier (max_depth = 2)
 [[  3  40]
 [  0 632]]
TP: 632, FP: 40, TN:3, FN:0

DecisionTreeClassifier ทายผิดไป 3

ลองเช็คค่าต่าง ๆ ด้วย

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

print('Accuracy: {:.2f}'.format(accuracy_score(y, lr_predicted)))
print('Precision: {:.2f}'.format(precision_score(y, lr_predicted)))
print('Recall: {:.2f}'.format(recall_score(y, lr_predicted)))
print('F1: {:.2f}'.format(f1_score(y, lr_predicted)))

Output:

Accuracy: 0.94
Precision: 0.94
Recall: 1.00
F1: 0.97


# Combined report with all above metrics
# สร้างมาโชว์แบบ classification_report
from sklearn.metrics import classification_report
print('Logistic regression\n', 
      classification_report(y, lr_predicted, target_names = ['not 1', '1']))
print('Decision tree\n', 
      classification_report(y, tree_predicted, target_names = ['not 1', '1']))

Output:

Logistic regression
               precision    recall  f1-score   support

       not 1       0.00      0.00      0.00        43
           1       0.94      1.00      0.97       634

    accuracy                           0.94       677
   macro avg       0.47      0.50      0.48       677
weighted avg       0.88      0.94      0.91       677

Decision tree
               precision    recall  f1-score   support

       not 1       0.00      0.00      0.00        43
           1       0.94      1.00      0.97       634

    accuracy                           0.94       677
   macro avg       0.47      0.50      0.48       677
weighted avg       0.88      0.94      0.91       677

Decision functions and Changing thresholds

ในส่วนนี้จะเริ่มจากตรงนี้ Here สามารถเข้าไปอ่านเพิ่มเติมได้เลย

เริ่มจากเราจะกำหนดให้แสดง scores Decision_Function สำหรับ 20 อินสแตนซ์แรก

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
y_scores_lr = lr.fit(X_train, y_train).decision_function(X_test)
y_score_list = list(zip(y_test[0:20], y_scores_lr[0:20]))

# show the decision_function scores for first 20 instances
y_score_list

ต่อมาเป็นการแสดงความน่าจะเป็นของ positive class สำหรับ 20 อินสแตนซ์แรก

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
y_proba_lr = lr.fit(X_train, y_train).predict_proba(X_test)
y_proba_list = list(zip(y_test[0:20], y_proba_lr[0:20,1]))

# show the probability of positive class for first 20 instances
y_proba_list

lr_predicted = lr.predict(X_test)
print('Logistic regression\n', 
      classification_report(y_test, lr_predicted, target_names = ['not 1', '1']))

y_predicted = y_proba_lr[:, 0] > .5

#[:, 1] อันนี้คือ  class 1 คือ class 1

print('Logistic regression (threshold=50%)\n', 
      classification_report(y_test, y_predicted, target_names = ['not 1', '1']))

โดยปกติ Logistic Regression จะเซ็ทค่า threshold ไว้ที่ 0.5

อ่าน Report ของโมเดลอื่น ๆ

from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
# Create a report for each model
from sklearn.metrics import classification_report, f1_score, confusion_matrix
from sklearn import metrics

algo = [
        [GaussianNB(),'GaussianNB'],
        [GradientBoostingClassifier(),'GradientBoostingClassifier'],
        [AdaBoostClassifier(),'AdaBoostClassifier'],
        [RandomForestClassifier(),'RandomForestClassifier']
  ]

for a in algo:
    model=a[0]
    model.fit(X_train,y_train)
    y_pred = model.predict(X_test)
    print(f'{a[0]} score: {model.score(X_test,y_test)}')
    print("f1_score:",f1_score(y_test, y_pred))
    print(metrics.confusion_matrix(y_test,y_pred))
    print(metrics.classification_report(y_test,y_pred))

Output:

GaussianNB() score: 0.9176470588235294
f1_score: 0.9570552147239264
[[  0  13]
 [  1 156]]
              precision    recall  f1-score   support

           0       0.00      0.00      0.00        13
           1       0.92      0.99      0.96       157

    accuracy                           0.92       170
   macro avg       0.46      0.50      0.48       170
weighted avg       0.85      0.92      0.88       170

GradientBoostingClassifier() score: 0.9411764705882353
f1_score: 0.9681528662420382
[[  8   5]
 [  5 152]]
              precision    recall  f1-score   support

           0       0.62      0.62      0.62        13
           1       0.97      0.97      0.97       157

    accuracy                           0.94       170
   macro avg       0.79      0.79      0.79       170
weighted avg       0.94      0.94      0.94       170

AdaBoostClassifier() score: 0.9352941176470588
f1_score: 0.9661538461538461
[[  2  11]
 [  0 157]]
              precision    recall  f1-score   support

           0       1.00      0.15      0.27        13
           1       0.93      1.00      0.97       157

    accuracy                           0.94       170
   macro avg       0.97      0.58      0.62       170
weighted avg       0.94      0.94      0.91       170

RandomForestClassifier() score: 0.9470588235294117
f1_score: 0.9712460063897764
[[  9   4]
 [  5 152]]
              precision    recall  f1-score   support

           0       0.64      0.69      0.67        13
           1       0.97      0.97      0.97       157

    accuracy                           0.95       170
   macro avg       0.81      0.83      0.82       170
weighted avg       0.95      0.95      0.95       170

Logistic Regression

model_b = LogisticRegression()


model_b.fit(X_train, y_train)

predicted_b = model_b.predict(X_train) #เราจะได้ Output ออกมาเป็น Array

ลอง Print confusion_matrix อีกครั้ง

print(confusion_matrix(y_train, predicted_b))

Output: 

[[  0  30]
 [  0 477]]

print(accuracy_score(y_train, predicted_b))

Output: 

0.9408284023668639

สรุปว่าครั้งนี้ทายว่าผิด 30 ทายว่าถูก 477 โดยได้ค่า accuracy_score = 0.94

แสดง report

print(classification_report(y_train, predicted_b))

Output:

precision    recall  f1-score   support

           0       0.00      0.00      0.00        30
           1       0.94      1.00      0.97       477

    accuracy                           0.94       507
   macro avg       0.47      0.50      0.48       507
weighted avg       0.89      0.94      0.91       507

Grid-Search with Cross-Validation

Cross-Validation นั้นทำงานโดยแบ่งชุดข้อมูลของเราออกเป็นกลุ่มสุ่ม จับกลุ่มหนึ่งเป็นการทดสอบ และฝึกโมเดลให้กับกลุ่มที่เหลือ และฝึกโมเดลให้กับกลุ่มที่เหลือ กระบวนการนี้ทำซ้ำสำหรับแต่ละกลุ่มที่จัดเป็นกลุ่มทดสอบ จากนั้นจึงใช้ค่าเฉลี่ยของโมเดลสำหรับผลลัพธ์ของโมเดล

โดย GridSearchCV ของ sklearn นี้ จะช่วยหา Parameters ที่ดีที่สุดของโมเดลที่เหมาะสมกับข้อมูลที่เรา

estimator — พารามิเตอร์นี้ให้เราเลือกโมเดลเฉพาะที่เราต้องการเรียกใช้ ในกรณีของเรา Random Forest Classification
param_grid — พารามิเตอร์นี้อนุญาตให้เราส่งกริดของพารามิเตอร์ที่เรากำลังค้นหา ตารางนี้ต้องจัดรูปแบบเป็นพจนานุกรมที่มีคีย์ตรงกับชื่อพารามิเตอร์ของตัวประมาณค่าเฉพาะ และค่าที่สอดคล้องกับรายการค่าที่จะส่งผ่านสำหรับพารามิเตอร์เฉพาะ
cv — พารามิเตอร์นี้อนุญาตให้เราเปลี่ยนจำนวนการพับสำหรับการตรวจสอบข้าม

np.set_printoptions(suppress=True, precision=3)

scaler = StandardScaler()
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
ridge = Ridge().fit(X_train_scaled, y_train)

X_test_scaled = scaler.transform(X_test)
ridge.score(X_test_scaled, y_test)

Output:

-0.058316647869001814

from sklearn.model_selection import GridSearchCV
param_grid = {'alpha': np.logspace(-3, 3, 13)}
print(param_grid)

grid = GridSearchCV(Ridge(), param_grid, cv=10, return_train_score=True)
grid.fit(X_train_scaled, y_train)

Output:

GridSearchCV(cv=10, estimator=Ridge(),
             param_grid={'alpha': array([   0.001,    0.003,    0.01 ,    0.032,    0.1  ,    0.316,
          1.   ,    3.162,   10.   ,   31.623,  100.   ,  316.228,
       1000.   ])},
             return_train_score=True)

import pandas as pd
results = pd.DataFrame(grid.cv_results_)
results.plot('param_alpha', 'mean_train_score')
results.plot('param_alpha', 'mean_test_score', ax=plt.gca())
plt.fill_between(results.param_alpha.astype(float),
                 results['mean_train_score'] + results['std_train_score'],
                 results['mean_train_score'] - results['std_train_score'], alpha=0.2)
plt.fill_between(results.param_alpha.astype(float),
                 results['mean_test_score'] + results['std_test_score'],
                 results['mean_test_score'] - results['std_test_score'], alpha=0.2)
plt.legend()
plt.xscale("log")

print(grid.best_params_)
print(grid.best_score_)

#y กับ y_predict จากการทำ cross validation
from sklearn.model_selection import cross_val_predict
lr = linear_model.LinearRegression()

y_predicted = cross_val_predict(lr, X, y, cv=10)

fig, ax = plt.subplots()
ax.scatter(y, y_predicted, edgecolors=(0, 0, 0))

ax.set_xlabel('Measured')
ax.set_ylabel('Predicted')
plt.show()

linreg = LinearRegression().fit(X_train_scaled, y_train)
plt.scatter(range(X.shape[1]), linreg.coef_, c=np.sign(linreg.coef_), cmap="bwr_r")

พล็อตเปรียบเทียบ Ridge และ Lasso ซึ่ง cross-validation score ของ X_train_scaled, y_train ระหว่าง Ridge และ Lasso ที่มี Alpha ต่างกัน

alphas = np.logspace(-3, 3, 30)

plt.figure(figsize=(5, 3))

for Model in [Ridge, Lasso]:
    scores = [cross_val_score(Model(alpha), X_train_scaled, y_train, cv=10, scoring='r2').mean() 
              for alpha in alphas]
    plt.plot(alphas, scores, 'o-.', label=Model.__name__)

plt.legend(loc='lower left')
plt.xlabel('alpha')
plt.ylabel('cross validation score')
plt.tight_layout()
plt.xscale("log")
plt.show()

PolynomialFeatures/ElasticNet

from sklearn.preprocessing import PolynomialFeatures, scale


param_grid = {'alpha': np.logspace(-4, -1, 10), 'l1_ratio': [0.01, .1, .5, .9, .98, 1]}
print(param_grid)

Output:

{'alpha': array([0.   , 0.   , 0.   , 0.001, 0.002, 0.005, 0.01 , 0.022, 0.046,
       0.1  ]), 'l1_ratio': [0.01, 0.1, 0.5, 0.9, 0.98, 1]}

from sklearn.linear_model import ElasticNet
grid = GridSearchCV(ElasticNet(max_iter=1e6), param_grid, cv=10, return_train_score=True)
grid.fit(X_train, y_train)

Output:

GridSearchCV(cv=10, estimator=ElasticNet(max_iter=1000000.0),
             param_grid={'alpha': array([0.   , 0.   , 0.   , 0.001, 0.002, 0.005, 0.01 , 0.022, 0.046,
       0.1  ]),
                         'l1_ratio': [0.01, 0.1, 0.5, 0.9, 0.98, 1]},
             return_train_score=True)

print(grid.best_params_)
print(grid.best_score_)

Output:

{'alpha': 0.00046415888336127773, 'l1_ratio': 1}
0.03133010539198968

import pandas as pd
res = pd.pivot_table(pd.DataFrame(grid.cv_results_), values='mean_test_score', index='param_alpha', columns='param_l1_ratio')
pd.set_option("display.precision",3)
res = res.set_index(res.index.values.round(4))
res

ลองนำมาพล็อต colorbar

import seaborn as sns
plt.figure(dpi=100)
plt.imshow(res) #, vmin=.70, vmax=.825)
plt.colorbar()
alphas = param_grid['alpha']
l1_ratio = np.array(param_grid['l1_ratio'])
plt.xlabel("l1_ratio")
plt.ylabel("alpha")
plt.yticks(range(len(alphas)), ["{:.4f}".format(a) for a in alphas])
plt.xticks(range(len(l1_ratio)), l1_ratio);

สรุป:

จากการลองนำมาเทรนหลาย ๆ โมเดลพบว่ามีทั้งที่มีความเป็นไปได้และไม่ได้ผล โดยผู้เขียคาดว่า Classification โมเดลจเหมาะกับข้อมูลนี้มากกว่า Regression แต่ทั้งนี้หามการจัดการข้อมูลสำหรับการสร้างโมเดลได้ดีกว่านี้ ก็คาดว่าทั้งสองโมเดลนั้นจะสามารถนำมาประเมินความเสียหายได้ อย่างในบทความนี้ Road Crash Prediction Models: Different Statistical Modeling Approaches ก็ได้แนะนำโมเดลที่น่าสนใจไว้ หากท่านใดสนใจก็สามารถศึกษาเพิ่มเติมได้เลยนะคะ ส่วนทางผู้เขียนเองก็ต้องไปซึ่งศึกษาเพิ่มเติมเยอะเลยค่ะ ขอบคุณทุกท่านที่เสียเวลามาอ่านบทความนี้มาก ๆ เลยค่ะ

สักวันจะกลับมาเขียนแก้เกมอีกครั้งนะคะ 🤣