機器學習入門-訂單遲到預測模型 v2

終於找到時間改進一下上次的 訂單預測v1模型,這次會順便貼上一些code

和上次一模一樣的流程

  1. 取得資料
  2. 篩選、清理資料
  3. 建立模型
  4. 訓練模型
  5. 結果

先匯入會用到的套件:

import numpy
import pandas as pd
from sklearn import preprocessing

1.取得資料 (跟上次一樣的資料)

取出50,000筆歷史訂單資料

這次從特徵值下手,加入運送人員的編號,嘗試改善模型的準確率

all_df = pd.read_csv(“order_training_data.csv”, index_col=False)
all_df.shape
# (50000, 7) 五萬筆資料,七個欄位

我們的資料長這樣:

2.擷取資料特徵

建立一個function來處理資料

b-1. 由於送貨員編號這種資料屬於分類資料,直接轉為數字意義不大,pd.get_dummies函數可用來將分類變數(Categorical variable)轉換為“虛擬矩陣”(dummy matrix)或稱 “指標矩陣”(indicator matrix)

大概像下面這樣:

b-2. 將資料正規化,轉為0~1之間的數值

c.-3 輸出 資料特徵&標籤

def PreprocessData(raw_df):
x_OneHot_df = pd.get_dummies(data=raw_df,columns=["expectedDayOfWeek", "courierId"])
ndarray = x_OneHot_df.values
Features = ndarray[:,1:]
Label = ndarray[:,0]
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1))
scaledFeatures = minmax_scale.fit_transform(Features)

return scaledFeatures, Label
這裡有個陷阱:這裡的送貨員編號原本的長相是:oooxxx1234abcd,一串沒有意義的編碼,如果直接丟給pd.get_dummies是無法處理的,必須先轉為有順序的編碼如:a,b,c,d…. or 0,1,2,3….的型態。

將資料分為8:2,約40,000筆訓練資料、10,000筆驗證資料

msk = numpy.random.rand(len(all_df)) < 0.8
all_df = PreprocessData(all_df)
train_Features,train_Label = all_df[0][msk], all_df[1][msk]
test_Features,test_Label = all_df[0][~msk], all_df[1][~msk]

3.建立模型

一樣用MLP (Multi-Layer Perception) 做訓練

架構為:

  1. 一層輸入層,輸入資料有76個特徵值,直接對應到76個輸入層神經元
  2. 一層隱藏層,參考前人的做法取輸入層數量÷2 = 38個
  3. 一層輸出層,我們只要知道會不會遲到(0,1),所以只有一個神經元

隨機drop 25%資料,避免overfitting

from keras.models import Sequential
from keras.layers import Dense,Dropout
model = Sequential()
model.add(Dense(units=76, input_dim=76, 
kernel_initializer='uniform',
activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(units=38,
kernel_initializer='uniform',
activation='relu'))
model.add(Dropout(0.25))
model.add(Dense(units=1,
kernel_initializer='uniform',
activation='sigmoid'))
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_10 (Dense) (None, 76) 5852
_________________________________________________________________
dropout_8 (Dropout) (None, 76) 0
_________________________________________________________________
dense_11 (Dense) (None, 38) 2926
_________________________________________________________________
dropout_9 (Dropout) (None, 38) 0
_________________________________________________________________
dense_12 (Dense) (None, 1) 39
=================================================================
Total params: 8,817
Trainable params: 8,817
Non-trainable params: 0
_________________________________________________________________
None

4. 訓練模型

開始訓練50個週期,過程如下:

model.compile(loss='binary_crossentropy', 
optimizer='adam', metrics=['accuracy'])
train_history =model.fit(x=train_Features, 
y=train_Label,
validation_split=0.1,
epochs=50,
batch_size=200,verbose=2)
Train on 36029 samples, validate on 4004 samples
Epoch 1/50
1s - loss: 0.4134 - acc: 0.8235 - val_loss: 0.6000 - val_acc: 0.6938
Epoch 2/50
0s - loss: 0.4140 - acc: 0.8230 - val_loss: 0.6037 - val_acc: 0.6931
Epoch 3/50
0s - loss: 0.4110 - acc: 0.8231 - val_loss: 0.6051 - val_acc: 0.6926
Epoch 4/50
0s - loss: 0.4131 - acc: 0.8227 - val_loss: 0.6047 - val_acc: 0.6923
Epoch 5/50
0s - loss: 0.4117 - acc: 0.8230 - val_loss: 0.6016 - val_acc: 0.6966
Epoch 6/50
0s - loss: 0.4120 - acc: 0.8239 - val_loss: 0.6013 - val_acc: 0.6953
Epoch 7/50
0s - loss: 0.4110 - acc: 0.8240 - val_loss: 0.6018 - val_acc: 0.6963
Epoch 8/50
0s - loss: 0.4108 - acc: 0.8239 - val_loss: 0.6066 - val_acc: 0.6888
Epoch 9/50
0s - loss: 0.4111 - acc: 0.8241 - val_loss: 0.6054 - val_acc: 0.6931
Epoch 10/50
0s - loss: 0.4110 - acc: 0.8246 - val_loss: 0.6026 - val_acc: 0.6943
Epoch 11/50
0s - loss: 0.4116 - acc: 0.8233 - val_loss: 0.6070 - val_acc: 0.6871
Epoch 12/50
0s - loss: 0.4106 - acc: 0.8234 - val_loss: 0.6065 - val_acc: 0.6903
Epoch 13/50
0s - loss: 0.4093 - acc: 0.8251 - val_loss: 0.6052 - val_acc: 0.6911
Epoch 14/50
0s - loss: 0.4107 - acc: 0.8234 - val_loss: 0.6056 - val_acc: 0.6911
Epoch 15/50
0s - loss: 0.4094 - acc: 0.8234 - val_loss: 0.6033 - val_acc: 0.6943
Epoch 16/50
0s - loss: 0.4098 - acc: 0.8252 - val_loss: 0.6116 - val_acc: 0.6863
Epoch 17/50
0s - loss: 0.4098 - acc: 0.8241 - val_loss: 0.6075 - val_acc: 0.6938
Epoch 18/50
0s - loss: 0.4093 - acc: 0.8252 - val_loss: 0.6102 - val_acc: 0.6903
Epoch 19/50
0s - loss: 0.4101 - acc: 0.8239 - val_loss: 0.6086 - val_acc: 0.6893
Epoch 20/50
0s - loss: 0.4088 - acc: 0.8251 - val_loss: 0.6101 - val_acc: 0.6871
Epoch 21/50
0s - loss: 0.4088 - acc: 0.8247 - val_loss: 0.6069 - val_acc: 0.6906
Epoch 22/50
0s - loss: 0.4085 - acc: 0.8261 - val_loss: 0.6058 - val_acc: 0.6921
Epoch 23/50
0s - loss: 0.4082 - acc: 0.8240 - val_loss: 0.6073 - val_acc: 0.6918
Epoch 24/50
0s - loss: 0.4071 - acc: 0.8254 - val_loss: 0.6079 - val_acc: 0.6891
Epoch 25/50
0s - loss: 0.4080 - acc: 0.8237 - val_loss: 0.6058 - val_acc: 0.6911
Epoch 26/50
0s - loss: 0.4089 - acc: 0.8238 - val_loss: 0.6032 - val_acc: 0.6916
Epoch 27/50
0s - loss: 0.4067 - acc: 0.8246 - val_loss: 0.6099 - val_acc: 0.6873
Epoch 28/50
0s - loss: 0.4075 - acc: 0.8243 - val_loss: 0.6053 - val_acc: 0.6928
Epoch 29/50
0s - loss: 0.4076 - acc: 0.8246 - val_loss: 0.6073 - val_acc: 0.6911
Epoch 30/50
0s - loss: 0.4081 - acc: 0.8251 - val_loss: 0.6089 - val_acc: 0.6883
Epoch 31/50
0s - loss: 0.4065 - acc: 0.8256 - val_loss: 0.6057 - val_acc: 0.6921
Epoch 32/50
0s - loss: 0.4072 - acc: 0.8251 - val_loss: 0.6055 - val_acc: 0.6956
Epoch 33/50
0s - loss: 0.4069 - acc: 0.8247 - val_loss: 0.6099 - val_acc: 0.6888
Epoch 34/50
0s - loss: 0.4071 - acc: 0.8252 - val_loss: 0.6031 - val_acc: 0.6913
Epoch 35/50
0s - loss: 0.4076 - acc: 0.8240 - val_loss: 0.6094 - val_acc: 0.6881
Epoch 36/50
0s - loss: 0.4063 - acc: 0.8263 - val_loss: 0.6026 - val_acc: 0.6958
Epoch 37/50
0s - loss: 0.4054 - acc: 0.8252 - val_loss: 0.6074 - val_acc: 0.6898
Epoch 38/50
0s - loss: 0.4058 - acc: 0.8253 - val_loss: 0.6117 - val_acc: 0.6841
Epoch 39/50
0s - loss: 0.4057 - acc: 0.8241 - val_loss: 0.6040 - val_acc: 0.6921
Epoch 40/50
0s - loss: 0.4066 - acc: 0.8253 - val_loss: 0.6148 - val_acc: 0.6866
Epoch 41/50
0s - loss: 0.4064 - acc: 0.8251 - val_loss: 0.6119 - val_acc: 0.6908
Epoch 42/50
0s - loss: 0.4059 - acc: 0.8246 - val_loss: 0.6139 - val_acc: 0.6836
Epoch 43/50
0s - loss: 0.4058 - acc: 0.8256 - val_loss: 0.6049 - val_acc: 0.6933
Epoch 44/50
0s - loss: 0.4068 - acc: 0.8255 - val_loss: 0.6094 - val_acc: 0.6926
Epoch 45/50
0s - loss: 0.4056 - acc: 0.8247 - val_loss: 0.6085 - val_acc: 0.6928
Epoch 46/50
0s - loss: 0.4052 - acc: 0.8249 - val_loss: 0.6100 - val_acc: 0.6893
Epoch 47/50
0s - loss: 0.4061 - acc: 0.8245 - val_loss: 0.6084 - val_acc: 0.6896
Epoch 48/50
0s - loss: 0.4049 - acc: 0.8259 - val_loss: 0.6095 - val_acc: 0.6893
Epoch 49/50
0s - loss: 0.4052 - acc: 0.8253 - val_loss: 0.6094 - val_acc: 0.6901
Epoch 50/50
0s - loss: 0.4050 - acc: 0.8248 - val_loss: 0.6108 - val_acc: 0.6906

看字很痛苦,把訓練過程畫成圖來看看

import matplotlib.pyplot as plt
def show_train_history(train_history,train,validation):
plt.plot(train_history.history[train])
plt.plot(train_history.history[validation])
plt.title('Train History')
plt.ylabel(train)
plt.xlabel('Epoch')
plt.legend(['train', 'validation'], loc='upper left')
plt.show()
show_train_history(train_history,’acc’,’val_acc’)

5. 模型訓練成果

把一開始切出來約10,000筆資料,用我們剛剛訓練出來的模型驗證看看

scores = model.evaluate(x=test_Features, y=test_Label)
scores[1]
# 0.80646132236978074

從上次的73% 提高到了80% !!

個人覺得80%還挺滿意的XD

準確率要再提高的話,也許嘗試再從資料下手,把一些離峰值排除掉,

再加入更多特徵值,下次會研究一下熱感應圖,將資料關聯性視覺化。