機器學習入門-訂單遲到預測模型 v1

Jack Sung
11 min readJun 27, 2017

--

最近在入門機器學習,也試著從基礎開始學,但是進度緩慢,所以嘗試從另一個角度切入,更快速地獲得成就感。

找了Keras + TensorFlow的教學,跟著Tutorial做,玩了幾個資料集的分類辨識,像是MNIST資料集的手寫數字辨識、CIFAR-10的圖形分類、鐵達尼號存活名單等。

有了一點頭緒後,開始想要拿真實資料來做點練習,敝司做送餐服務,所以有訂單資料可以拿來做實驗,想了一個最簡單的應用:

能不能在下單的時候預測會不會遲到?

遲到的定義:「實際送達時間」超過「預計送達時間」

目標:利用歷史訂單資料,預測未來訂單遲到機率

  1. 取得資料
  2. 篩選、清理資料
  3. 建立模型
  4. 訓練模型
  5. 結果

1.取得資料

  1. 取出50,000筆歷史訂單資料
  2. 將資料分為8:2,約40,000筆訓練資料、10,000筆驗證資料

2.擷取資料特徵

這裡取出五個資料特徵

  1. 建立訂單時間 (從00:00到當下經過幾秒,ex: 08:00 => 28,800)
  2. 預期送達時間
  3. 預期送達時間是星期幾
  4. 訂單金額
  5. 店家到收件人的GPS直線距離
是否遲到 下單時間 預期送達 星期幾 訂單金額 直線距離(KM)
0 65909 69506 4 170.0 0.284
0 66973 70508 4 220.0 0.892
0 69379 72978 4 165.0 0.999
1 35522 40223 4 715.0 2.010
1 47290 50819 3 120.0 1.928

3.建立模型

使用MLP (Multi-Layer Perception) 做訓練

架構為一層的隱藏層,有50個神經元

隨機drop 25%資料,避免overfitting

Layer (type)                 Output Shape              Param #   
=================================================================
dense_25 (Dense) (None, 50) 300
_________________________________________________________________
dropout_16 (Dropout) (None, 50) 0
_________________________________________________________________
dense_26 (Dense) (None, 50) 2550
_________________________________________________________________
dropout_17 (Dropout) (None, 50) 0
_________________________________________________________________
dense_27 (Dense) (None, 1) 51
=================================================================
Total params: 2,901
Trainable params: 2,901
Non-trainable params: 0
_________________________________________________________________
None

4.訓練模型

訓練50個週期,過程如下:

Train on 36000 samples, validate on 4000 samples
Epoch 1/50
1s - loss: 0.6080 - acc: 0.7043 - val_loss: 0.8420 - val_acc: 0.4160
Epoch 2/50
0s - loss: 0.5719 - acc: 0.7044 - val_loss: 0.8220 - val_acc: 0.4160
Epoch 3/50
0s - loss: 0.5630 - acc: 0.7044 - val_loss: 0.8623 - val_acc: 0.4160
Epoch 4/50
0s - loss: 0.5626 - acc: 0.7135 - val_loss: 0.8599 - val_acc: 0.4205
Epoch 5/50
0s - loss: 0.5614 - acc: 0.7201 - val_loss: 0.8360 - val_acc: 0.4235
Epoch 6/50
0s - loss: 0.5599 - acc: 0.7228 - val_loss: 0.8640 - val_acc: 0.4225
Epoch 7/50
0s - loss: 0.5593 - acc: 0.7289 - val_loss: 0.8484 - val_acc: 0.4243
Epoch 8/50
0s - loss: 0.5580 - acc: 0.7331 - val_loss: 0.8200 - val_acc: 0.4325
Epoch 9/50
0s - loss: 0.5558 - acc: 0.7369 - val_loss: 0.8183 - val_acc: 0.4455
Epoch 10/50
0s - loss: 0.5549 - acc: 0.7416 - val_loss: 0.8893 - val_acc: 0.4285
Epoch 11/50
0s - loss: 0.5531 - acc: 0.7453 - val_loss: 0.8699 - val_acc: 0.4328
Epoch 12/50
0s - loss: 0.5513 - acc: 0.7487 - val_loss: 0.8541 - val_acc: 0.4612
Epoch 13/50
0s - loss: 0.5498 - acc: 0.7509 - val_loss: 0.8374 - val_acc: 0.4720
Epoch 14/50
0s - loss: 0.5485 - acc: 0.7532 - val_loss: 0.8222 - val_acc: 0.4755
Epoch 15/50
0s - loss: 0.5472 - acc: 0.7552 - val_loss: 0.8516 - val_acc: 0.4737
Epoch 16/50
0s - loss: 0.5452 - acc: 0.7581 - val_loss: 0.8349 - val_acc: 0.4725
Epoch 17/50
0s - loss: 0.5444 - acc: 0.7566 - val_loss: 0.8619 - val_acc: 0.4740
Epoch 18/50
0s - loss: 0.5434 - acc: 0.7576 - val_loss: 0.8681 - val_acc: 0.4817
Epoch 19/50
0s - loss: 0.5426 - acc: 0.7595 - val_loss: 0.8281 - val_acc: 0.4975
Epoch 20/50
0s - loss: 0.5408 - acc: 0.7596 - val_loss: 0.8386 - val_acc: 0.4890
Epoch 21/50
0s - loss: 0.5409 - acc: 0.7604 - val_loss: 0.8589 - val_acc: 0.4780
Epoch 22/50
0s - loss: 0.5406 - acc: 0.7608 - val_loss: 0.8116 - val_acc: 0.5085
Epoch 23/50
0s - loss: 0.5379 - acc: 0.7613 - val_loss: 0.8493 - val_acc: 0.4885
Epoch 24/50
0s - loss: 0.5370 - acc: 0.7632 - val_loss: 0.8339 - val_acc: 0.4945
Epoch 25/50
0s - loss: 0.5368 - acc: 0.7634 - val_loss: 0.8293 - val_acc: 0.4972
Epoch 26/50
0s - loss: 0.5368 - acc: 0.7637 - val_loss: 0.8559 - val_acc: 0.4865
Epoch 27/50
0s - loss: 0.5360 - acc: 0.7629 - val_loss: 0.8265 - val_acc: 0.5127
Epoch 28/50
0s - loss: 0.5366 - acc: 0.7630 - val_loss: 0.8437 - val_acc: 0.5000
Epoch 29/50
0s - loss: 0.5358 - acc: 0.7617 - val_loss: 0.8552 - val_acc: 0.4892
Epoch 30/50
0s - loss: 0.5339 - acc: 0.7646 - val_loss: 0.8532 - val_acc: 0.4985
Epoch 31/50
0s - loss: 0.5345 - acc: 0.7633 - val_loss: 0.8315 - val_acc: 0.5030
Epoch 32/50
0s - loss: 0.5335 - acc: 0.7635 - val_loss: 0.8540 - val_acc: 0.4935
Epoch 33/50
0s - loss: 0.5334 - acc: 0.7642 - val_loss: 0.8278 - val_acc: 0.5077
Epoch 34/50
0s - loss: 0.5315 - acc: 0.7627 - val_loss: 0.8398 - val_acc: 0.5030
Epoch 35/50
0s - loss: 0.5309 - acc: 0.7634 - val_loss: 0.8189 - val_acc: 0.5155
Epoch 36/50
0s - loss: 0.5304 - acc: 0.7633 - val_loss: 0.8561 - val_acc: 0.4965
Epoch 37/50
0s - loss: 0.5298 - acc: 0.7634 - val_loss: 0.8361 - val_acc: 0.5010
Epoch 38/50
0s - loss: 0.5302 - acc: 0.7636 - val_loss: 0.8322 - val_acc: 0.5015
Epoch 39/50
0s - loss: 0.5280 - acc: 0.7649 - val_loss: 0.7999 - val_acc: 0.5233
Epoch 40/50
0s - loss: 0.5291 - acc: 0.7641 - val_loss: 0.8377 - val_acc: 0.4997
Epoch 41/50
0s - loss: 0.5272 - acc: 0.7647 - val_loss: 0.8138 - val_acc: 0.5040
Epoch 42/50
0s - loss: 0.5285 - acc: 0.7640 - val_loss: 0.8447 - val_acc: 0.4950
Epoch 43/50
0s - loss: 0.5274 - acc: 0.7647 - val_loss: 0.8716 - val_acc: 0.4987
Epoch 44/50
0s - loss: 0.5272 - acc: 0.7646 - val_loss: 0.8332 - val_acc: 0.5070
Epoch 45/50
0s - loss: 0.5268 - acc: 0.7662 - val_loss: 0.8189 - val_acc: 0.5138
Epoch 46/50
0s - loss: 0.5268 - acc: 0.7676 - val_loss: 0.8281 - val_acc: 0.5110
Epoch 47/50
0s - loss: 0.5260 - acc: 0.7692 - val_loss: 0.8099 - val_acc: 0.5150
Epoch 48/50
0s - loss: 0.5244 - acc: 0.7710 - val_loss: 0.8181 - val_acc: 0.5138
Epoch 49/50
0s - loss: 0.5240 - acc: 0.7704 - val_loss: 0.8620 - val_acc: 0.4950
Epoch 50/50
0s - loss: 0.5237 - acc: 0.7702 - val_loss: 0.8241 - val_acc: 0.5090

可以看出訓練過程中,訓練和驗證準確率有滿大的落差,似乎是有overfitting的現象。

5.模型預測結果

預測結果準確率為73%

score = 0.73050000000000004

我們的資料上有許多誤差,例如外送員沒有及時的變更狀態,造成沒遲到的訂單在資料上變成遲到,或是預期送達時間設定錯誤,造成資料跟實際狀況不同。

改進方向:

也許可以加入更多特徵值、增加dropout、把神經網路加寬加深來提高準確率。

--

--