[ML] Wine Quality Data Analysis with PyTorch 之一

Tim Wong

Published in

深思心思

6 min readJan 4, 2020

[ML] Wine Quality Data Analysis with PyTorch 之一

日期: 2020-Jan-4, 作者: Tim Wong

Data Source:

https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

以下會用PyTorch 為主，明顯是殺雞用牛刀，之後加入neural network 時才是好戲上演。

import torch
import csv
import numpy as npdataFilePath = './data/winequality-white.csv'
wine_data = np.loadtxt(dataFilePath, dtype=np.float32, delimiter=";", skiprows=1)
heading = next(csv.reader(open(dataFilePath), delimiter=';'))wine_data = torch.from_numpy(wine_data)
wine_feature = wine_data[:,:-1]
wine_target = wine_data[:,-1].long()
wine_target_oneHot=torch.zeros(\
     wine_target.shape[0],10).scatter_(1, \
     wine_target.unsqueeze(1), 1.0)
wine_data_mean = torch.mean(wine_data, dim=0)
wine_data_std = torch.std(wine_data, dim=0)# z-score normalization
wine_data = (wine_data-wine_data_mean)/wine_data_std

Min-Max Normalization 可使所有 features 的數值都在0–1之間。

Z-score Normalization 可以使features 的means 在0 而standard deviation 在1，但不會保證所有features 都在什麼 x 至 y 之間的。

現在用一個極簡單的classification 方法，先試一下prediction 的成績。

bad_wine = wine_feature[torch.le(wine_target,3)]
mid_wine = wine_feature[torch.gt(wine_target,3) & torch.lt(wine_target,7)]
good_wine = wine_feature[torch.ge(wine_target,7)]
bad_mean = torch.mean(bad_wine, dim=0)
mid_mean = torch.mean(mid_wine, dim=0)
good_mean = torch.mean(good_wine, dim=0)for i,args in enumerate(zip(heading, bad_mean, mid_mean, good_mean)):
    print('{:2} {:20} {:6.2f} {:6.2f} {:6.2f}'.format(i, *args))------------
output:
 0 fixed acidity          7.60   6.89   6.73
 1 volatile acidity       0.33   0.28   0.27
 2 citric acid            0.34   0.34   0.33
 3 residual sugar         6.39   6.71   5.26
 4 chlorides              0.05   0.05   0.04
 5 free sulfur dioxide   53.33  35.42  34.55
 6 total sulfur dioxide 170.60 141.83 125.25
 7 density                0.99   0.99   0.99
 8 pH                     3.19   3.18   3.22
 9 sulphates              0.47   0.49   0.50
10 alcohol               10.34  10.26  11.42

用3分及7分為分界線，把酒分為bad, mid, good, 發現 “totaal sulfur dioxide” 的值在141.83(average) 之上者為good-wine.

假如用這個標準來判定good-wine, bad-wine 是不是好方法呢？(ground true 是： 5分以上為good, 5分以下為bad)

total_sulfur_threshold = 141.83
total_sulfur_data = wine_feature[:,6]
myGoodWinePrediction = torch.lt(total_sulfur_data,\ 
                            total_sulfur_threshold)
actual_good_wine = torch.gt(wine_target,5)

評價一下這個方法吧。

condition = actual_good_wine & myGoodWinePrediction
truePositive = torch.sum(condition).item()
predPositive = torch.sum(myGoodWinePrediction).item()
actualPositive = torch.sum(actual_good_wine).item()precision = truePositive/predPositive
recall = truePositive/actualPositive
specificity = (wine_data.shape[0]-truePositive)/(wine_data.shape[0]-actualPositive)print("Precision :", round(precision*100,0), "% (the prediction is in xx% accuracy.")
print("Recall :",round(recall*100), "% (only xx% of positive were identitied by this method)")
print("Specificity :",round(specificity*100), "% xx% more negative prediction than ground true")

結果是：

Precision : 74.0 % (the prediction is in xx% accuracy.
Recall : 62 % (only xx% of positive were identitied by this method)
Specificity : 176 % (xx% more negative prediction than ground true)

是不好的prediction, 且看neural network 是否可以幫忙。

全力衝刺中的一團火

我是阿Tim | timwong.ai@gmail.com

[ML] Wine Quality Data Analysis with PyTorch 之一

Written by Tim Wong