Step-by-step guide for predicting Wine Preferences using Scikit-Learn

Nataliia Rastoropova
Analytics Vidhya
Published in
11 min readMay 17, 2019

In case you are new at Machine Learning and it’s hair-raising to write Machine Learning project just dive into the data science pipeline with this blog-post. This blog-post covers the basic Machine Learning process in Python step-by-step.

Basic steps of a data science pipeline:

  1. Frame the problem and look at the big picture.
  2. Get the data.
  3. Explore the data to gain insights.
  4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms.
  5. Explore many different models and short-list the best ones.
  6. Fine-tune your models and combine them into a great solution.
  7. Present your solution.
  8. Launch, monitor, and maintain your system.

Feel free to adapt this pipeline to your needs!

1. Frame the problem and look at the big picture.

Firstly, we need to define the objective in business terms. For our work is to predict human wine taste preferences that are based on easily available analytical tests at the certification step. We expect to get an accuracy score of more than 90%. Predicted value could be used for designing new types of wine, defining pricing policy or supporting decision making in advisory systems.

During problem framing, we defined that our system would be using batch learning techniques. It would be an expensive and unnecessary task if we had gathered more data and provided controlled research of wine ranking based on physicochemical properties.

Therefore the system will be trained and then launched into production and run without learning anymore. It just will apply what it has learned. This will generally take time and computing resources.

Also, we could try two kinds of approaches:

  • build an instance-based learning system by comparing new data points to known data points.
  • build a model-based learning system by detecting patterns in the training data and build a predictive model, much like scientists do.

Let’s make assumptions.

The null hypothesis (H0) is that none of the variance in the quality ranking is explained by physicochemical properties. The alternate hypothesis (H1) is that physicochemical properties contribution to the variance in the quality ranking and make a wine ‘good’ or vice verse ‘bad’.

2. Get the data

We will use a real data set related to red Vinho Verde wine samples, from the north of Portugal. This dataset is available from the UCI machine learning repository, https://archive.ics.uci.edu/ml/datasets/wine+quality. This dataset can be viewed as classification or regression tasks.

Input variables (based on physicochemical tests):
1 fixed acidity;
2 volatile acidity;
3 citric acid;
4 residual sugar;
5 chlorides;
6 free sulfur dioxide;
7 total sulfur dioxide;
8 density;
9 pH;
10 sulphates;
11 alcohol.

Output variable (based on sensory data):
12 quality (score between 0 and 10).

Let’s get the data and convert them to a format of a data frame for making manipulation with their easier. Data investigation is an interesting and addictive task. Take a look at your data, check the dimensionality and type of them.

In [1]:

# Load in the red wine data from the UCI ML website.
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv',sep=';')
# Take a look
print(df.head(10))
# Data dimensionality (rows, colums)
print(df.shape)
# Data distributing
df.info()

Out [1]:

Before data investigation you should have been done with one of the most important steps is data splitting. The data has been split into two groups: training set 80%, test set 20%. The training set should be used to build your machine learning models. The test set should be used to see how well your model performs on unseen data.

In [2]:

# Now seperate the dataset as response variable and feature variabes
X = df.drop('quality', axis=1)
y = df['quality']
# Train and Test splitting of data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)
# Applying Standard scaling to get optimized result
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.fit_transform(X_test)

No data snooping!

3. Explore the data to gain insights

So far we have only taken a quick glance at the data to get a general understanding of the kind of data we are manipulating. Now we would be exploring data using a test set.

It would be interesting to take a look at the basic statistical characteristics of each numerical feature. The count, mean, min, and max rows are self-explanatory. The std row shows the standard deviation (which measures how dispersed the values are). The 25%, 50%, and 75% of rows show the corresponding percentiles.

In [3]:

# Statistical characteristics of each numerical feature
print(df.describe())

Out [3]:

Visualizing data is crucial for recognizing underlying patterns to exploit in the model. When the data visualizing properly it’s clear to see trends and patterns, the correlation between variables, because our brains are very good at spotting patterns on pictures. Let’s play around with different types of data visualization, parameters to make the patterns stand out.

Using the following plots we may understand data distribution for separate attributes; for example, data distribution for attribute “alcohol” is positively skewed, for attribute “density” data quite normally distributed. Take attention to the wine quality data distribution. It’s a bimodal distribution and there are more wines with average quality than wines with ‘good’ or ‘bad’ quality.

In [4]:

# Histograms
df.hist(bins=10,figsize=(6, 5))
plt.show()
# Density
df.plot(kind='density', subplots=True, layout=(4,3), sharex=False)
plt.show()

Out [4]:

Histogram
Density

Another way exploring data is an incredibly handy tool pivot table. A pivot table is a summary of your data, packaged in a chart that lets you report on and explore trends based on your information. Pivot tables are particularly useful if you have long rows or columns that hold values you need to track the sums of and easily compare to one another. So, our pivot table describes the median value each feature for each score of quality. Now we can follow trends, for example, the highest value for ‘sulphates’ tend the highest ‘quality’ score. But we can’t draw our conclusion based on correlation.

Correlation does not cause causation.

In [5]:

# Create pivot_table
colum_names = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']
df_pivot_table = df.pivot_table(colum_names,
['quality'], aggfunc='median')
print(df_pivot_table)

Out [5]:

For understanding how much each attribute correlates with the quality score of wine compute the standard correlation coefficient (also called Pearson’s r) between every pair of attributes.

In [6]:

corr_matrix = df.corr()
print(corr_matrix["quality"].sort_values(ascending=False))

Out [5]:

The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation; for example, the ‘quality’ value tends to go up when the ‘alcohol’ goes up. When the coefficient is close to –1, it means that there is a strong negative correlation; you can see a small negative correlation between the ‘volatile acidity’ and the ‘quality’ value. Finally, coefficients close to zero mean that there is no linear correlation.

You could see more detailed information about data correlation using the correlation matrix. The correlation matrix gives us information about how the two variables interact, both the direction and magnitude.

In [7]:

colum_names = ['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality']
# Correlation matrix
correlations = df.corr()
# Plot figsize
fig, ax = plt.subplots(figsize=(10, 10))
# Generate Color Map
colormap = sns.diverging_palette(220, 10, as_cmap=True)
# Generate Heat Map, allow annotations and place floats in map
sns.heatmap(correlations, cmap=colormap, annot=True, fmt=".2f")
ax.set_xticklabels(
colum_names,
rotation=45,
horizontalalignment='right'
);
ax.set_yticklabels(colum_names);
plt.show()

Out [7]:

You could visualize the scatterplot matrix for the better understanding relationship between a pair of variables. It plots every numerical attribute against every other.

In[8]:

# Scatterplot Matrix
sm = scatter_matrix(df, figsize=(6, 6), diagonal='kde')
#Change label rotation
[s.xaxis.label.set_rotation(40) for s in sm.reshape(-1)]
[s.yaxis.label.set_rotation(0) for s in sm.reshape(-1)]
#May need to offset label when rotating to prevent overlap of figure
[s.get_yaxis().set_label_coords(-0.6,0.5) for s in sm.reshape(-1)]
#Hide all ticks
[s.set_xticks(()) for s in sm.reshape(-1)]
[s.set_yticks(()) for s in sm.reshape(-1)]
plt.show()

Out[8]:

Take attention to the correlation between attributes ‘fixed acidity’ and ‘density’. The correlation coefficient for them is 0.67 (you could find value at the correlation matrix) and looking at corresponding for them scatterplot we can see the positive linear correlation between attributes. You can also clearly see the upward trend and the points are not too dispersed between those attributes. In that way, you could analyze other attributes.

One last thing you may want to do before actually preparing the data for Machine Learning algorithms is to try out various attribute combinations.

4. Prepare the data to better expose the underlying data patterns to Machine Learning algorithms

It’s time to prepare the data for our Machine Learning algorithms. In our dataset there aren’t any missing values, outliers, attributes that provide no useful information for the task. So, we could conclude that our data set is quite clean. Therefore, we won’t do any grueling data preparation, but some stuff will be needed to do. Human wine preferences scores varied from 3 to 8, so it’s straightforward to categorize answers into ‘bad’ or ‘good’ quality of wines. This allows us to practice with hyperparameter tuning on e.g. decision tree algorithms. Visualizing the graph of the number of values for each category, we could see that there are far many bad answers than good ones. Of course, machine learning algorithms operate digital values, so we assign for categorizes corresponding discrete values 0 or 1.

In[9]:

# Dividing wine as good and bad by giving the limit for the quality
bins = (2, 6, 8)
group_names = ['bad', 'good']
df['quality'] = pd.cut(df['quality'], bins = bins, labels = group_names)
# Now lets assign a labels to our quality variable
label_quality = LabelEncoder()
# Bad becomes 0 and good becomes 1
df['quality'] = label_quality.fit_transform(df['quality'])
print(df['quality'].value_counts())
sns.countplot(df['quality'])
plt.show()

Out[9]:

5. Explore many different models and short-list the best ones

Evaluating a Machine Learning model can be quite tricky. Usually model evaluated performance based on an error metric. Nevertheless, this method is not very reliable as the accuracy obtained for one test set can be very different to the accuracy obtained for a different test set.

In the example below 8 different algorithms are compared:

  1. Support Vector Classifier
  2. Stochastic Gradient Decent Classifier
  3. Random Forest Classifier
  4. Decision Tree Classifier
  5. Gaussian Naive Bayes
  6. K-Neighbors Classifier
  7. Ada Boost Classifier
  8. Logistic Regression

The key to a fair comparison of machine learning algorithms is ensuring that each algorithm is evaluated in the same way on the same data. K-fold Cross Validation(CV) provides a solution to this problem by dividing the data into folds and ensuring that each fold is used as a testing set at some point.

In[10]:

# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
models.append(('SupportVectorClassifier', SVC()))
models.append(('StochasticGradientDecentC', SGDClassifier()))
models.append(('RandomForestClassifier', RandomForestClassifier()))
models.append(('DecisionTreeClassifier', DecisionTreeClassifier()))
models.append(('GaussianNB', GaussianNB()))
models.append(('KNeighborsClassifier', KNeighborsClassifier()))
models.append(('AdaBoostClassifier', AdaBoostClassifier()))
models.append(('LogisticRegression', LogisticRegression()))
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()

Out[10]:

6.Fine-tune your models and combine them into a great solution

There are several factors that can help you determine which algorithm performs best. One such factor is the performance on cross-validation set and another factor is the choice of parameters for an algorithm.

Let’s fine-tune some Machine Learning algorithms. The first algorithm that we trained and then evaluated was Support Vector Classifier and the mean value for model prediction was equal 0.873364. What is the most harmonious way to choose hyper-parameters to your model? It is trying to guess or looping over parameters and then run all the combinations of parameters? There is one more advantageous way to use Grid Search CV.

In[11]:

def svc_param_selection(X, y, nfolds):
param = {
'C': [0.1, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4],
'kernel': ['linear', 'rbf'],
'gamma': [0.1, 0.8, 0.9, 1, 1.1, 1.2, 1.3, 1.4]
}
grid_search = GridSearchCV(svc, param_grid=param, scoring='accuracy', cv=nfolds)
grid_search.fit(X, y)
return grid_search.best_params_
print(svc_param_selection(X_train, y_train,10))

Out[11]:

{'C': 1.3, 'gamma': 1.3, 'kernel': 'rbf'}

Let’s run our SVC again with the best parameters.

In[12]:

svc = SVC(C = 1.3, gamma =  1.3, kernel= 'rbf')
svc.fit(X_train, y_train)
pred_svc = svc.predict(X_test)
print('Confusion matrix')
print(confusion_matrix(y_test, pred_svc))print('Classification report')
print(classification_report(y_test, pred_svc2))
print('Accuracy score',accuracy_score(y_test, pred_svc2))

Out[12]:

So, after choosing parameters for an algorithm accuracy score for predicting wine preference is 93%.

Now let’s try to fine-tune another algorithm AdaBoost, with an accuracy score of 86%, using cross-validation.

In[13]:

ada_classifier = AdaBoostClassifier(n_estimators=100)
ada_classifier.fit(X_train, y_train)
pred_ada = ada_classifier.predict(X_test)

# Cross-validation
scores = cross_val_score(ada_classifier,X_test,y_test, cv=5)
print('Accuracy score',scores.mean())

Out[13]:

Accuracy score 0.903

Another absorbing thing that we could analyze is feature importance for Machine Learning Algorithms.

importance=ada_classifier.feature_importances_

std = np.std([tree.feature_importances_ for tree in ada_classifier.estimators_],
axis=0)
indices = np.argsort(importance)

# Plot the feature importances of the forest
plt.figure()
plt.title("Feature importances")
plt.barh(range(X.shape[1]), importance[indices],
color="b", align="center")

plt.yticks(range(X.shape[1]), colum_names)
plt.ylim([0, X.shape[1]])
plt.show()

7. Present your solution

Let’s make some conclusions based on the pipeline that we have done. Our data set is quite clean and representative. We categorize target value into discrete values that correspond ‘bad or ‘good’ quality of wines.

Getting inside the data set we found an interesting correlation between features ‘alcohol’ and ‘quality of wine’, ‘fixed acidity’ and ‘density’ and another one. After analyzing feature importance for Machine Learning Algorithms we could conclude that tuning features like ‘alcohol’, ‘sulphates’ and ‘pH’ may make wine scores higher or lower. Based on this information would be profitable to change your tune and play around with the physicochemical properties of wine. Because it could make impact on human preferences.

Through comparison analyze that we provided we could highlight Machine Learning algorithms like Support Vector Classifier and Random Forest Classifier. Box plots of those algorithm’s accuracy distribution quite symmetrical, without outliers. Adjacent box plot values are close together that corresponds to the high density of accuracy scores.

By the way, without tuning Machine Learning algorithms we can’t achieve average accuracy scores more than 90 % for prediction wine preference. We fine-tuned Support Vector Classifier using Grid Search CV and achieve accuracy score 93%. The goal had achieved.

8. Launch, monitor, and maintain your system

Now your solution is ready for production! This system uses batch learning technique and won't train on fresh data. Therefore, it doesn't tend to “rot” as data evolves over time. But if your model regularly trained on fresh data you need to write monitoring code to check your system’s live performance at regular intervals and trigger alerts when it drops. In that case, you could catch not only sudden breakage but also performance degradation.

A full Jupyter Notebook of code for the project can be located here.

Thank you for reading! I hope this article gave you a basic understanding of data science pipeline. If you have any questions or find a mistake feel free to comment or write to me shkarupa.nataliia@gmail.com.

--

--

Nataliia Rastoropova
Analytics Vidhya

Master’s Degree in Computer Science. Software Engineer. I’m not afraid to get my hands dirty with code. Addicted to ML and AI. Ukraine❤