Wine Quality Prediction Using Machine Learning
I love everything that’s old, — old friends, old times, old manners, old books, old wine. — Oliver Goldsmith
Having read that, let us start with our short Machine Learning project on wine quality prediction using scikit-learn’s Decision Tree Classifier.
First of all, we need to install a bunch of packages that would come handy in the construction and execution of our code. Write the following commands in terminal or command prompt (if you are using Windows) of your laptop.
numpy will be used for making the mathematical calculations more accurate, pandas will be used to work with file formats like csv, xls etc. and sklearn (scikit-learn) will be used to import our classifier for prediction.
We are now done with our requirements, let’s start writing some awesome magical code for the predictor we are going to build.
Let’s start with importing the required modules.
You maybe now familiar with numpy and pandas (described above), the third import,
from sklearn.model_selection import train_test_split is used to split our dataset into training and testing data, more of which will be covered later. The next import,
from sklearn import preprocessing is used to preprocess the data before fitting into predictor, or converting it to a range of -1,1, which is easy to understand for the machine learning algorithms. The last import, from
sklearn import tree is used to import our decision tree classifier, which we will be using for prediction.
The very next step is importing the data we will be using. For this project, we will be using the Wine Dataset from UC Irvine Machine Learning Repository.
pd.read_csv() function in pandas to import the data by giving the dataset url of the repository. Notice that ‘;’ (semi-colon) has been used as the separator to obtain the csv in a more structured format.
Now we have to analyse, the dataset. First we will see what is inside the data set by seeing the first five values of dataset by head() command.
We see a bunch of columns with some values in them. Now, in every machine learning program, there are two things, features and labels. Features are the part of a dataset which are used to predict the label. And labels on the other hand are mapped to features. After the model has been trained, we give features to it, so that it can predict the labels.
So, if we analyse this dataset, since we have to predict the wine quality, the attribute
quality will become our label and the rest of the attributes will become the features.
Our next step is to separate the features and labels into two different dataframes.
We just stored and
y, which is the common symbol used to represent the labels in machine learning and dropped
quality and stored the remaining features in
X , again common symbol for features in ML.
Next, we have to split our dataset into test and train data, we will be using the train data to to train our model for predicting the quality. The next part, that is the test data will be used to verify the predicted values by the model.
We have used,
train_test_split() function that we imported from sklearn to split the data. Notice we have used
test_size=0.2 to make the test data 20% of the original data. The rest 80% is used for training.
Now let’s print and see the first five elements of data we have split using
After we obtained the data we will be using, the next step is data normalization. It is part of pre-processing in which data is converted to fit in a range of -1 and 1. These are simply, the values which are understood by a machine learning algorithm easily.
You can observe, that now the values of all the train attributes are in the range of -1 and 1 and that is exactly what we were aiming for.
Time has now come for the most exciting step, training our algorithm so that it can predict the wine quality. We do so by importing a
DecisionTreeClassifier() and using
fit() to train it.
The next step is to check how efficiently your algorithm is predicting the label (in this case wine quality). This can be done using the
The confidence score:
This score can change over time depending on the size of your dataset and shuffling of data when we divide the data into test and train, but you can always expect a range of ±5 around your first result.
Now we are almost at the end of our program, with only two steps left. First of which is the prediction of data. Now that we have trained our classifier with features, we obtain the labels using
y_pred = clf.predict(X_test)
Our predicted information is stored in
y_pred but it has far too many columns to compare it with the expected labels we stored in
y_test . So we will just take first five entries of both, print them and compare them.
Don’t be intimidated, we did nothing magical there. We just converted
y_pred from a numpy array to a list, so that we can compare with ease. Then we printed the first five elements of that list using
for loop. And finally, we just printed the first five values that we were expecting, which were stored in
head() function. The output looks something like this.
Notice that almost all of the values in the prediction are similar to the expectations. Our predictor got wrong just once, predicting 7 as 6, but that’s it. This gives us the accuracy of 80% for 5 examples. Of course, as the examples increases the accuracy goes down, precisely to 0.621875 or 62.1875%, but overall our predictor performs quite well, in-fact any accuracy % greater than 50% is considered as great.
Unfortunately, our rollercoaster ride of tasting wine has come to an end. But stay tuned to click-bait for more such rides in the world of Machine Learning, Neural Networks and Deep Learning.