Red Wine Quality Prediction using Classification Model

Raksha Srinivasan
4 min readAug 7, 2021

--

Wine Quality classification is a difficult piece of work since taste is the least factor of the human senses. A good wine quality prediction can be very useful in the certification process. This project aims to determine which features are the best quality of red wine and generate insights into each of these attributes.

Description of Dataset

The attributes that are involved in this dataset are

  1. Fixed Acidity: These are non-volatile acids that do not evaporate readily.
  2. Volatile Acidity: The amount of acetic acid present in the wine.
  3. Citric Acid: It adds ‘Freshness’ and flavor to wines.
  4. Residual Sugar: Amount of sugar left after fermentation.
  5. Chlorides: The amount of salt in the wine.
  6. Free Sulfur Dioxide: SO2 prevents microbial growth and the oxidation of wine.
  7. Total sulfur Dioxide: Total SO2 becomes evident in the nose and taste of wine.
  8. Density: The density of water depends on the percentage of alcohol and sugar content.
  9. pH: It describes the level of acidity on a scale from 0–14. Most wines are always between 3–4 on the pH scale.
  10. Sulphates: A wine additive that contributes to SO2 levels and acts as an antimicrobial and antioxidant
  11. Alcohol: The percentage of alcohol content in the wine.
  12. Quality: which is the output variable/predictor.

Importing the Modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb

Lets discuss about each libraries, Pandas is used for data manipulation and analysis. Numpy is used to compute n-dimensional array object. Matplotlib and seaborn are similar in their functionalities which are used for visualization.

Loading the Data

data=pd.read_csv("C:/Users/raksh/OneDrive/Documents/Desktop/Datasets/winequality_red.csv")

data
data.head()
data.info()
data.describe()
Wine data
data.head()-Prints first n rows of the data
data.info()-It prints the concise summary of the data
data.describe()-It prints the statistical details of the data
data.isnull().sum()

Using isnull().sum() function we can find out the missing values in the data.

Exploratory Data Analysis

The graphical representation of the data that provides the useful information to discover the patterns and insights of the data. Here in this wine data, various graphs are plotted to find the quality of the wine.

data.plot(x='alcohol',y='quality',style='.',color='r') plt.title('alcohol vs quality') 
plt.xlabel('alcohol')
plt.ylabel('quality')
plt.grid()
plt.show()
data.plot(x='free sulfur dioxide',y='total sulfur dioxide', style='.',color='g')
plt.title('free sulfur dioxide vs total sulfur dioxide')
plt.xlabel('free sulfur dioxide')
plt.ylabel('total sulfur dioxide')
plt.grid()
plt.show()
data.hist(figsize=(10,15))
plt.show()
This image shows that how the data is easily scattered on features.
sb.catplot(data=data, kind="bar",x="quality",y="citric acid",palette="pastel",alpha=.5, height=5)
This image checks what value of citric acid can able to make changes in quality.
sb.countplot(data["quality"], palette="pastel")
data["quality"].value counts()
sb.boxplot(x="quality",y="citric acid",data=data)
sb.boxplot(x="quality",y="fixed acidity",data=data)

Correlation

data.corr()
sb.heatmap(data.corr())
data['winequality']=[1 if x>=6 else 0 for x in data['quality']]
X=data.drop(['quality','winequality'],axis = 1)
y=data['winequality']
data

Splitting the Data

from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.2,random_state=0)

We use train_test_split() in sklearn model selection to split the data into two subsets-Training and Testing data. Here in this wine data, training is carried out with 80% and testing with 20%.

Normalizing the data

from sklearn.preprocessing import MinMaxScalermm=MinMaxScaler()
fit=mm.fit(X_train)
X_train=fit.transform(X_train)
X_test=fit.transform(X_test)

We use normalization because the data is unbalanced and so we scale them to 1 and 0. MinMaxScaler() function is used to transform the features by scaling each feature to a given range.

Evaluating the Model

Random Forest Classifier is the model used here to evaluate the performance of wine quality. Since Random Forest Classifier takes less training time and also it gives the output with high accuracy even for large dataset it is the efficient algorithm preferred among all the ML models.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

random=RandomForestClassifier()
fit=random.fit(X_train,y_train)

score=random.score(X_test,y_test)
print('The score of the model is : ',score)
predict=random.predict(X_test)
print(classification_report(predict,y_test))
data= {'Actual values': y_test,'predicted values': predict}
pd.DataFrame(data)

Finally we conclude that Almost all the predicted values are similar to the actual values. This model gives us the accuracy of 80%.A good wine quality is identified by its smell,taste,balance of its components. By predicting the quality of wine using ML technique we can help industries to certify the process of classifying whether it is good quality or bad quality of wine.

Link to the repository: https://github.com/Raksha-Srinivasan/Red_Wine_Quality_Prediction

--

--