Tim Wong
深思心思
Published in
6 min readJan 4, 2020

--

[ML] Wine Quality Data Analysis with PyTorch 之一

日期: 2020-Jan-4, 作者: Tim Wong

Data Source:

https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

以下會用PyTorch 為主,明顯是殺雞用牛刀,之後加入neural network 時才是好戲上演。

import torch
import csv
import numpy as np
dataFilePath = './data/winequality-white.csv'
wine_data = np.loadtxt(dataFilePath, dtype=np.float32, delimiter=";", skiprows=1)
heading = next(csv.reader(open(dataFilePath), delimiter=';'))
wine_data = torch.from_numpy(wine_data)
wine_feature = wine_data[:,:-1]
wine_target = wine_data[:,-1].long()
wine_target_oneHot=torch.zeros(\
wine_target.shape[0],10).scatter_(1, \
wine_target.unsqueeze(1), 1.0)
wine_data_mean = torch.mean(wine_data, dim=0)
wine_data_std = torch.std(wine_data, dim=0)
# z-score normalization
wine_data = (wine_data-wine_data_mean)/wine_data_std

Min-Max Normalization 可使所有 features 的數值都在0–1之間。

Z-score Normalization 可以使features 的means 在0 而standard deviation 在1,但不會保證所有features 都在什麼 x 至 y 之間的。

現在用一個極簡單的classification 方法,先試一下prediction 的成績。

bad_wine = wine_feature[torch.le(wine_target,3)]
mid_wine = wine_feature[torch.gt(wine_target,3) & torch.lt(wine_target,7)]
good_wine = wine_feature[torch.ge(wine_target,7)]
bad_mean = torch.mean(bad_wine, dim=0)
mid_mean = torch.mean(mid_wine, dim=0)
good_mean = torch.mean(good_wine, dim=0)
for i,args in enumerate(zip(heading, bad_mean, mid_mean, good_mean)):
print('{:2} {:20} {:6.2f} {:6.2f} {:6.2f}'.format(i, *args))
------------
output:
0 fixed acidity 7.60 6.89 6.73
1 volatile acidity 0.33 0.28 0.27
2 citric acid 0.34 0.34 0.33
3 residual sugar 6.39 6.71 5.26
4 chlorides 0.05 0.05 0.04
5 free sulfur dioxide 53.33 35.42 34.55
6 total sulfur dioxide 170.60 141.83 125.25
7 density 0.99 0.99 0.99
8 pH 3.19 3.18 3.22
9 sulphates 0.47 0.49 0.50
10 alcohol 10.34 10.26 11.42

用3分及7分為分界線,把酒分為bad, mid, good, 發現 “totaal sulfur dioxide” 的值在141.83(average) 之上者為good-wine.

假如用這個標準來判定good-wine, bad-wine 是不是好方法呢?(ground true 是: 5分以上為good, 5分以下為bad)

total_sulfur_threshold = 141.83
total_sulfur_data = wine_feature[:,6]
myGoodWinePrediction = torch.lt(total_sulfur_data,\
total_sulfur_threshold)
actual_good_wine = torch.gt(wine_target,5)

評價一下這個方法吧。

condition = actual_good_wine & myGoodWinePrediction
truePositive = torch.sum(condition).item()
predPositive = torch.sum(myGoodWinePrediction).item()
actualPositive = torch.sum(actual_good_wine).item()
precision = truePositive/predPositive
recall = truePositive/actualPositive
specificity = (wine_data.shape[0]-truePositive)/(wine_data.shape[0]-actualPositive)
print("Precision :", round(precision*100,0), "% (the prediction is in xx% accuracy.")
print("Recall :",round(recall*100), "% (only xx% of positive were identitied by this method)")
print("Specificity :",round(specificity*100), "% xx% more negative prediction than ground true")

結果是:

Precision : 74.0 % (the prediction is in xx% accuracy.
Recall : 62 % (only xx% of positive were identitied by this method)
Specificity : 176 % (xx% more negative prediction than ground true)

是不好的prediction, 且看neural network 是否可以幫忙。

全力衝刺中的一團火

我是阿Tim | timwong.ai@gmail.com

--

--