Learn Machine Learning 01 — Classification

Editor: Ishmael Njie

DataRegressed Team
DataRegressed
6 min readAug 10, 2018

--

Just someone pouring wine… Image credit: theresonance

In supervised learning, classification algorithms are used to predict the grouping of instances based on their features. In this post, we will look at classifying wine samples as good or bad!

We will use Scikit-Learn in Python for the Machine Learning modules. If you would like to see the full implementation, then check out my Github repo for this series.

The data used is the Wine Quality data set, which can be found on the UCI Machine Learning Repository website; a great repository for many datasets geared for Machine Learning tasks. In the dataset, we’ve been given many features along with the target variable (quality). Features include: pH level and alcohol %.

Let’s get stuck in!

First we need to import our dependencies:

import pandas as pd 
import seaborn as sb
import matplotlib.pyplot as plt

#Algos
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier

#Tools for modelling
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.model_selection import train_test_split

We import pandas for data manipulation and preprocessing, as well as seaborn and matplotlib for data visualisation purposes.

Secondly, our classifiers; we will go into the algorithms later on in the post. We brought in the Logistic Regression, Multiple Layer Perceptron and Random Forest module from Scikit-Learn.

To aid in modelling, we needed to bring in some other dependencies:

  • StandardScaler is used to standardise the feature values, also referred to normalisation.
  • confusion_matrix allows us to visualise the accuracy/performance of the models. The classification_report produces the precision, recall and f-score of each model. Finally, the accuracy_score allows us to print the accuracy score of each model.
  • train_test_split is a fantastic module for splitting our dataset into training and testing data.

We will perform classification on the red wine dataset. After bringing in the data, we can use seaborn to visualise the correlation between each feature via a heat map.

Correlation heatmap

The feature we are interested in is the ‘quality’ feature. As you can see, ‘alcohol’ has a fairly positive correlation at 0.5 with ‘sulphates’ behind at 0.3.

Now, let us look at the number of samples in each quality class. The quality of a wine sample is a score between 0 and 10.

Number of samples in each quality class: multi-class

We can see that the samples seem to follow a gaussian shape, where the majority of the samples are in the ‘middle’. The information that came with the dataset illustrated that the classes are not balanced and that there are much more ‘normal’ quality samples than great or poor. We can see that there are an even number of classes: [3,4,5,6,7,8]; so we will change the classes and have classes 3,4,5 as bad wine samples and 6,7,8 as good wine samples. This will change this task from a multi-class classification problem to a binary classification problem (0/1).

Number of samples in each quality class: Binary classification

After producing these new classes, we can see that the number of good wine samples is more than the number of bad samples. In classification, this is important to highlight; the imbalance in classes may affect the performance of the classifiers.

Time to set up our features and response for training.

Below, we define our feature vector X and our response y. We will include all of the features in the dataset for this model (we will look at feature selection later).

X = df1[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar','chlorides', 'free sulfur dioxide', 
'total sulfur dioxide', 'density','pH', 'sulphates', 'alcohol']]
y = df1['quality']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1)

After we define our input and response, we will split the data set into training and testing. We train the model using the train data; then use the testing data to assess the performance of the model for unseen data. We used a split of 80:20.

We also normalised our dataset as features vary in magnitudes, as the features with high magnitudes are likely to affect the models in training.

Now, let’s look at our models:

  • Logistic Regression:

Logistic Regression is a classification model that is often used in binary classification problems. It uses the sigmoid function to estimate the probability of the response.

Given the sigmoid function in the image below, t is a function of the input features x; which is the proposed linear separation of the instances.

Sigmoid plot
  • Multi-Layer Perceptron (MLP):
Example neural net

MLP is a form of a Neural Network that consists of more than 3 layers. You will have an input layer, an output layer and a number of hidden layers. Neural Nets act as a tool in Machine Learning to find patterns in the input data and their relationship between the target variable. The fully connected network takes input values and maps them to each neuron in the next layer. The neurons in the following layer hold a value that is a linear combination of the neurons in the previous layer. This value is then applied to an activation function that determines the output of the node. The sigmoid function seen in the Logistic Regression portion of this post is an example of an activation function.

  • Random Forest:
Example random forest.

Random Forests are known as ensemble methods, where they generate small classifiers and take the majority vote of the output class to use for prediction. Specifically, Random Forests generate smaller Decision Tree classifiers; an advantage of this is that the Random Forest corrects the overfitting concern highlighted in the case of the Decision Tree.

After training, we can see the accuracy results of each model:

Accuracy scores

As one can see, the Random Forest has achieved the highest accuracy result at 81.9%. The Logistic Regression showed a good accuracy metric at 74% and the MLP did not perform as well at 54%; in any case, the Random Forest proved to be best. These classifiers were implemented with their default settings, we will look at changing the parameters in a later post.

Staying with the Random Forest, we can look at the importance of the features in the model:

Feature importance plot

Here we can see that the feature importance score emphasised the findings of the correlation heat map in terms of the ‘alcohol’. The alcohol % proved to have the highest correlation with the quality of the wine sample and our bar plot above shows that the alcohol % was the most important feature in the classifier.

Make sure you check out the notebook for a full look at the code!

Thank you for reading and follow us here at DataRegressed!

--

--