How to train your #NeuralNetwork for Wine tasting?

Photo credit

Wine tasting is a fine art which enables classification of Wine. When it comes to classification of wine, the practice is quite varied based on region of origin and time. It is one of the most tasteful traditions which is also protected by law of its own in certain regions. The classification varies based on vintage, sweetness, appellation , vinification styles, varietal or blend.

Is it possible to teach something about the classification of different variety of wines to Neural Networks? Well, I intend to do exactly that and get hands on with code as well.

In order to train a Neural Network on how to classify wine, we need to provide it a set of data labels that describes a variety of attributes that has real values for the attributes of wine. Since the training shall be supervised, we need to ensure the wine “classes” are also part of the training data so that we “teach” the Neural Nets about the attributes.

The Wine Dataset

I shall use the UC Irvine Machine Learning repository, which has a wine dataset having 13 attributes based off a chemical analysis done for different wine in the Italy region. The wines are from 3 different cultivars and hence has distinct chemical compositions that helps identify which cultivar is the wine from.

The link to the wine data set is > here

Citation: Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science

The characteristics of the data set is as follows (reproduced from the UCI link)

  • Data Set Characteristics: Multivariate
  • Number of Instances: 178
  • Attribute Characteristics: Integer, Real
  • Number of Attributes: 13
  • Missing Values? None

The names of the attributes are as follows:

  1. Alcohol
  2. Malic acid
  3. Ash
  4. Alcalinity of ash
  5. Magnesium
  6. Total phenols
  7. Flavanoids
  8. Nonflavanoid phenols
  9. Proanthocyanins
  10. Color intensity
  11. Hue
  12. OD280/OD315 of diluted wines
  13. Proline

The data values, which are in a comma separated format for these attributes are available in the UCI data folder > here

A sample of the data shall look like this:

1,14.23,1.71,2.43,15.6,127,2.8,3.06,.28,2.29,5.64,1.04,3.92,1065
2,12.37,.94,1.36,10.6,88,1.98,.57,.28,.42,1.95,1.05,1.82,520
3,12.25,3.88,2.2,18.5,112,1.38,.78,.29,1.14,8.21,.65,2,855
...

Note that the first value describes the “class” of the data. In other words, it is the ID of the cultivar. For the sake of this exercise, we don’t need the exact name or locality of the cultivar. We are fine with the data, as long as the value for the next 13 attributes are made available.

Now, lets design the Neural Network, that shall taste the wine and classifiy the wine to the correct cultivar.

Neural Network Type and Size

We need to determine the type of network, the number of input, hidden and output neurons and type of activation functions to use.

We shall use a standard Multilayer Feedforward Neural Network since this is good enough for classification tasks.

From the dataset, we know that there are 13 attributes, so we shall use 13 input neurons for the network.

Also, the total number of cultivars are 3, so we shall use 3 output neurons for the network to point to. The output shall be mutually exclusive (Only one output neuron shall be activated, denoting the correct cultivar for the input value)

Why are we using 3 separate output neurons as against using 1 output neuron to learn the 3 different IDs? Well, because the IDs themselves don’t have any real meaning. The value 1, 2, 3 can be replaced with 22, 23, 24 to represent the cultivar anytime and the dataset still holds. This is the characteristic of a “Multivariate” dataset.

Now, how many hidden neurons are needed? Let’s use our learnings from the previous post “is Optimizing your Neural Network a Dark-Art?” which provides us a thumb rule as follows:

Thumb rule to determine hidden neurons

As per the stated thumb rule we get:

The equation is self explanatory. I am using an ‘alpha’ of 2 since the inputs samples are low. Lets round up the hidden neurons needed as 6.

Feature Scaling : Or, Normalizing the Inputs

Before we design the activities of the network, its prudent to learn the characteristics of the input data. Without really understanding the data, there is no way you can design the activities of the Neural Net.

Note: All the values for input and output in a Neural Network MUST always be numerical.

Some of the characteristics of the input data can be:

  • The features maybe integers, or a combination of real and integers. (discrete versus continuous)
  • Different feature may vary in size. Some features can be in low fractions while others can be very large numbers. (across features)
  • Each feature by themselves may have varied values ranging from low to high. (within the feature)
  • Features can be sparse (as in only some may contain values)
  • Features can be dense (all of it may contain values)

Given this, its important to pre-process the input data. In the wine dataset, notice that you have real values ranging from 0.1 for some features and above 1000 for others. While all of this data is not correlated in its own, it is important in machine-learning to scale down the data to a meaningful number.

Specifically for Neural Network, the transfer potential is a inner-product of the input data to the weights. Since all input neurons is connected to every single hidden neuron, the number of fan-ins, and the value of the input neuron can get very large quickly. Even if you are not using a dot-product for your transfer potential function but decide to use some Radial Basis Function, the distance measure shall be very large (Euclidean distance) for the activation functions.

Also, if one of the features land up having a very broad range of value (Lets say a particular attribute has a value ranging from 10 to 10,000), then the rest of the distance of other variables will be governed by the particular feature (either Euclidian in case of a RBF or the vector length in case of a dot-product).

Its important to scale down the range of all the features so that each feature contributes proportionally to the final distance (to the vector length or some other metric)

Also, note that in Neural Networks, since we use Gradient Descent to reduce the empirical error, scaling down the range helps the network converge faster.

Note: that feature scaling is for the range of value within the feature and not across features

So, what are some of the techniques for Feature Scaling?

Rescaling the feature : Rescaling the feature is a technique that sets the feature to a range between {-1, 1} based on the maximum value and the minimum value of the range within the feature. The Rescaling function is as follows:

Rescaling Function
  • Here x-prime is the rescaled value
  • x-bar is the average, x-max is the maximum and x-min is the minimum in the range of possible values.

Zero-mean, unit-variance : You can also standardize your feature using a standard-score by determining the mean of the distribution within the range of possible values and the standard deviation. This is typically the most popular method used in other machine learning techniques such as Logistic-Regression or Support Vector Machines where margins for decision-boundaries has to be calculated. The equation for Zero-mean, unit variance is as follows:

Zero-mean, unit variance
  • The subtraction of the mean from the original value (Similar to Rescaling) makes it Zero-mean.
  • The division by standard deviation makes it unit variance.

Unit Length : Another technique is to reduce the feature range to a unit length (as against a unit-variance). Here, we normalize the length of the feature-range-vector to unit length 1. In other words, the range of features add up to 1. To do so, we divide each component value within the feature-range-vector by its Euclidean distance. The equation is as follows:

Unit-length function
  • The denominator represents the Euclidean distance, which is the root of sum of squares of all component values in the range of the feature.

How should we normalize the wine.data?

Here is a snapshot of the maximum and minimum values of the range in our features :

In the raw data, we can notice that the Proline feature has large values above 1000, while we have other features which are in fractions (Hue, Malic Acid, Total Phenols). This is quite a variance. Also the range (variance) within Proline is ranging from 278 to 1680. This seems true for Magnesium and Color intensity as well.

Given this, its prudent that we choose Zero-mean, unit-variance to normalize the raw data to given us a standard-score to work with the data.

Design of Network Activities

Now that we have got the type and size of network, we need to determine the activation functions for the network. We looked at activation functions in the previous post titled “Mathematical Foundation for Activation Functions

From the data, we understand that the data is multivariate classification with 3 classes. Also there are about 178 samples (this is not huge) and 13 input features (not big again).

Design of the Output Activities: For multivariate classification, we can choose the output layer activation function as softmax activation.

Why Softmax? : The output classifiers are multivariate and mutually exclusive in our example. In other words, the wine shall be classified into one and ONLY one class without a overlap.

We would need a mechanism to choose the right class with the highest degree of probability.

In other words, given a set of 13 input values, we need to determine the possibility (or probability) of the inputs to belong to either class 1, class 2, or class 3.

In other words, we are stating that the output is mutually exclusive and collectively exhaustive. Hence :

Here A or B or C represents the cultivars.

The softmax function is a quashing function, which takes a k-dimensional vector of any arbitrary real-values and converts that to a k-dimensional vector of real values in the range {0,1} which sums totally to 1.

The softmax function is given as follows:

Softmax function.

The softmax equation can be used in any probabilistic multiclass classification including multinomial regression.

Design of the Hidden Activities

The input activities has a choice of being a sigmoid, hyperbolic tan or a ReLU.

The sigmoid or the hyperbolic tanh has the tendency to quash the output of the hidden activity to a range between {0,1} or {-1,1} respectively. While this is useful, given that we are using a zero-mean-unit variance as a normalizer for the inputs and also given that we are using a softmax (as explained) for the output activity, it does not makes sense to restrict the hidden activity to a range. Instead, we may want to amplify the feature-differences in the input feature vector when its above a zero threshold so that the softmax gets to work with a linear scale when the input value is above zero.

A non-linear function, which becomes linear post zero-threshold is your Rectified Linear units. The function of the ReLU is quite easy to differentiate and the activity by itself is quite light as follows:

Rectified Linear Unit

Hence, we shall choose ReLU for the hidden activities.

Network Optimization Consideration

We shall keep the Stochastic Gradient Descent (standard) for the empirical error reduction as explained > here

We shall also use the Negative Log Likelihood as our Cost function as explained > here

We shall use a momentum (Also called the Nestrov Momentum) for dampening the velocity of learning as explained (momentum value; “alpha”) > here

So the backpropagation equation along with a learning rate and the momentum value shall look like this:

Backprop with learning rate and momentum-value

Code Showdown

Given that we have considered the pre-processing, network type, size and optimization procedures, its time to get to syntax.

Note: I am using DL4J as my Neural Network framework (I have explained why I am doing so > here)

The full code for the classifier can be found here > WineClassfier.java

The code sets up a network as follows:

There are 13 input neurons ranging from {I1….I13}, 6 hidden neurons {H1…H6}, and 3 output neurons {O1.. O3}. While, this is not the best picture to show the entirety of the architecture, you get the idea.

This section of code sets up the optimization constants

This section of code loads the training data

I have changed the training data from the UCI site to replace the class in the first col that was tagged as 1, 2, and 3 for the wine cultivars to 0, 1, and 2. The meaning of the class will not change by changing the ID. DL4J seems to need the classes to start from zero as against 1. The file I am using is made available here > wine.data

Also, one important thing you must note in the code is that I am splitting the wine.data file into 2 sections (as shown in line : 55).

I am using 65% of the file to train the Neural Network, while the remaining as test data for validation.

Also you can notice that in Line 60 to 63, I have normalized the raw data to a zero-mean, unit variance.

The following section of code sets up the Neural Network architecture

I have used, Stochastic Gradient Descent for optimization, a learning rate Epsilon of 5% on the error derivative and a momentum value alpha of 10% on the previous error delta to be used by the backpropagation algorithm.

The hidden neuron gets a ReLU activation and the output neurons get a softmax activation. The cost function is a negative log likelihood.

The final sections of the code, sets up the training to run for 30 epochs with 20 iteration each (found through trial and error) as the minimum set of iterations needed to teach the Neural Network how to classify wine data.

When your code finally runs, you should read an output as follows:

Classified Wine Data on the Test Set

The final output actually uses the 35% of the wine.data file that was set aside to test the Neural Network. Notice that there were about 13 class zero data, 26 class 1 data and 24 class 2 data that was used.

Good news is that every single test data that was used has been converged to a positive result.

It states, “examples labeled as 1 classified by model as 1: 26 times” which means that what should be classified as class-1 input is actually classified correctly as class-1.

Hoorah ! We have trained the Neural Network with a tasteful tradition in just 600 iterations !! You guys did well so far.

(I shall update this section with definitions for Accuracy, Precision, Recall and F1-score later. Do check-in in a day)

You can change the optimization parameters and activities to play around with the Network Model to see differing results.

Do shoot comments on any of my errata or questions you may have. Keep hacking…