Wine Quality Prediction with Machine Learning Model

Muhammad Arief Rachman

8 min readJul 14, 2023

Business Problem/Research Background

Business Problem/Research Background: Predicting wine quality accurately is a challenge in the wine industry. Implementing a machine learning model can improve the efficiency and accuracy of assessing wine quality. It enables companies to make informed decisions about production, sales, and marketing strategies.
Furthermore, consumers and wine professionals can benefit from a predictive model that provides insights into wine quality before making purchasing decisions.
In summary, using machine learning for wine quality prediction enhances operational efficiency, provides objective information to consumers, and supports decision-making in the wine industry.

Business Objective

Predict whether a wine is of good or bad quality based on factors such as chemical composition and other relevant attributes. This objective aims to provide an objective measure of wine quality that helps stakeholders differentiate between wines that meet high-quality standards and those that fall below expectations.

Business Metrics

Measure the accuracy and consistency of the model’s predictions across different wines. This metric indicates the reliability of the model in consistently assessing wine quality and can help ensure that wines are classified correctly.

Dataset

Dataset consists of 1599 rows and 12 columns. Data type of all variable are float

Understanding the different features of wine, we see that there are a total of 12 columns including the final quality parameter.

Fixed Acidity: are non-volatile acids that do not evaporate readily
Volatile Acidity: are high acetic acid in wine which leads to an unpleasant vinegar taste
Citric Acid: acts as a preservative to increase acidity. When in small quantities, adds freshness and flavor to wines
Residual Sugar: is the amount of sugar remaining after fermentation stops. The key is to have a perfect balance between sweetness and sourness. It is important to note that wines > 45g/ltrs are sweet
Chlorides: the amount of salt in the wine
Free Sulfur Dioxide: it prevents microbial growth and the oxidation of wine
Total Sulfur Dioxide: is the amount of free + bound forms of SO2
Density: sweeter wines have a higher density
pH: describes the level of acidity on a scale of 0–14. Most wines are always between 3–4 on the pH scale
Alcohol: available in small quantities in wines makes the drinkers sociable
Sulphates: a wine additive that contributes to SO2 levels and acts as an antimicrobial and antioxidant
Quality: which is the output variable/predictor

Exploratory Data Analysis

1. Checking for Null or Missing values

#Check null or missing value
df.isnull().sum()

output :
fixed acidity           0
volatile acidity        0
citric acid             0
residual sugar          0
chlorides               0
free sulfur dioxide     0
total sulfur dioxide    0
density                 0
pH                      0
sulphates               0
alcohol                 0
quality                 0
dtype: int64

It looks like there are no missing value. It means dataset can be processed.

2. Data Visualzation

Distribution variable target quality

Visualize the correlation contents in red wine

Insights From Above Figure :

Alcohol is positively correlated with the quality of the red wine.
Alcohol has a weak positive correlation with the pH value.
Citric acid and density have a strong positive correlation with fixed acidity.
pH has a negative correlation with density, fixed acidity, citric acid, and sulfates.

Preprocessing Data

1. Feature Scaling

Scale the dataset by quality

After reading the Red Wine Quality dataset description we find,

quality >= 7 is “good”
quality <= 7 is “bad”

df['quality'] = df['quality'].apply(lambda x: 1 if x >= 7 else 0)

sns.countplot(data = df, x = 'quality')
plt.xticks([0,1], ['bad wine','good wine'])
plt.title("Types of Wine")
plt.show()

from the above visualization we can see that the Dataset is skewed or unbalanced.

2. Resampling Dataset

for skewed or unbalanced dataset, we can do resampling using synthetic minority oversampling technique for data balancing

#parameter for requires seed
random_value = 1000

X = df[['fixed_acidity', 'volatile_acidity', 'sulphates', 'alcohol', 'density']]
y = df.qualityoversample = SMOTE()
X_ros, y_ros = oversample.fit_resample(X, y)sns.countplot(x=y_ros)
plt.xticks([0,1], ['bad wine','good wine'])
plt.title("Types of Wine")
plt.show()

3. Split dataset into train and test

# split dataset to train and test variable 
# use test size of 20% of the data proportion
X_train, X_test, y_train, y_test = train_test_split(X_ros, y_ros, test_size=0.2, random_state=random_value)
X_train.shape, X_test.shape

output:
((2211, 11), (553, 11))

4. Scale dataset with StandardScaler

# scale with StandardScaler
scaler = StandardScaler()

# fit to data training
scaler.fit(X_train)

# transform
x_train = scaler.transform(X_train)
x_test = scaler.transform(X_test)

Training Model

After completing the data preprocessing steps, we can now train our model. We will focus on using the random forest algorithm for this. Random forest is a popular and powerful machine learning technique that combines multiple decision trees to make predictions. It is capable of handling different types of data and can capture complex patterns in the wine attributes. During training, the model learns from the data and adjusts its internal parameters to make accurate predictions. Once trained, the model can be evaluated and deployed to predict the quality of wines based on their attributes.

Random Forest

# Random Forest Regression initialization
rfc = RandomForestClassifier(n_estimators=100, random_state=random_value)

# Cross Validation
rf_score = cross_val_score(estimator = rfc,
                               X = x_train, y= y_train,
                               scoring = 'recall',cv = 10,
                               verbose = 3, n_jobs=-1)# Fit data training
rfc.fit(x_train, y_train)# Predict data test
y_pred = rfc.predict(x_test)print('Avarage Recall score', np.mean(rf_score))
print('Test Recall score', recall_score(y_test, y_pred))output :
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done   3 out of  10 | elapsed:    0.8s remaining:    1.9s
[Parallel(n_jobs=-1)]: Done   7 out of  10 | elapsed:    0.8s remaining:    0.3s
[Parallel(n_jobs=-1)]: Done  10 out of  10 | elapsed:    1.3s finished
Avarage Recall score 0.9638738738738739
Test Recall score 0.9708029197080292

Based on the similar results obtained from the scores, it can be concluded that the model’s performance is good in predicting the target variable.

# Confusion Matrix
conf_mat = confusion_matrix(y_test, y_pred)

# Heatmap Confusion Matrix
sns.heatmap(conf_mat, cmap = 'Reds', annot = True, fmt='.1f')
plt.title('Confusion Matrix dari Prediksi Random Forest')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

Model prediction :

Our model predicted ‘0’ correctly 253 times while predicting ‘0’ incorrectly 26 times
Also it predicted ‘1’ incorrectly 8 times while predicting ‘1’ correctly 266 times

Hyperparameter Tuning

# Grid parameters
rf_grid = {
                'n_estimators': [50, 100, 200],
                'max_depth': [3, 5, 10],
                'min_samples_split': [2, 5, 10]
                }
                
# Use RandomizedSearchCV
rf_cv = RandomizedSearchCV(estimator=rfc, param_distributions=rf_grid,
                            scoring='recall', cv=10)

# Fit to model
rf_cv.fit(X_train, y_train)# Best Score
print(f'Best score: {rf_cv.best_score_}')
print(f'Best params: {rf_cv.best_params_}')output :
Best score: 0.9701801801801802
Best params: {'n_estimators': 100, 'min_samples_split': 5, 'max_depth': 10}

Compare Score

# Random Forest Regression initialization
rf_tuned = RandomForestClassifier(**rf_cv.best_params_, random_state=random_value)

# Cross Validation
rf_tuned_score = cross_val_score(estimator = rf_tuned,
                               X = x_train, y= y_train,
                               scoring = 'recall',cv = 10,
                               verbose = 0)# Fit data training
rf_tuned.fit(x_train, y_train)# Predict data test
y_pred_tuned = rf_tuned.predict(x_test)# Cek Score
print('Avarage Recall score', np.mean(rf_score))
print('Test Recall score', recall_score(y_test, y_pred))
print('Avarage Recall score Tuning', np.mean(rf_tuned_score))
print('Test Recall score Tuning', recall_score(y_test, y_pred_tuned))output :
Avarage Recall score 0.9638738738738739
Test Recall score 0.9708029197080292
Avarage Recall score Tuning 0.9701801801801802
Test Recall score Tuning 0.9708029197080292# Save Random Forest model
model_name = 'rf_model.pkl'
model_path = '../src/Model/{}'.format(model_name)
# Save the trained model
with open(model_path, 'wb') as f:
pickle.dump(rf_tuned, f)
print('Model saved as {}'.format(model_name))

From the obtained results, the performance has slightly improved.

The final step in the wine quality prediction process is to save the trained model as a file, such as a pickle file. This allows for easy deployment and future use without the need to retrain the model. Once saved, the model can be integrated into applications or systems where it can accept new input data about wines. The deployed model will then utilize the saved file to make accurate predictions on the quality of wines based on the provided input. This streamlined deployment process ensures efficient and reliable wine quality predictions for users and stakeholders.

Demo on Postman & Live App

Here’s an updated guide on deploying the wine quality prediction model using FastAPI, Docker, Streamlit, EC2, and testing it with Postman:

Set up an EC2 instance: Create and configure an EC2 instance on AWS to host your application. Make sure you have the necessary permissions and security groups configured.
Install Docker on the EC2 instance: Follow the Docker installation instructions to set up Docker on the EC2 instance.
Containerize the model with FastAPI and Streamlit: Build a Docker image that includes both the FastAPI application and the Streamlit interface. Use a Dockerfile to specify the dependencies and configuration for your application.
Run the Docker container: Start the Docker container on the EC2 instance, exposing the appropriate ports for communication. Verify that the FastAPI application and Streamlit interface are running correctly within the container.
Test the application with Postman: Use Postman to send HTTP requests to the FastAPI endpoints on the EC2 instance. Input the desired wine parameter values and observe the response with the predicted wine quality output.
Access the Streamlit interface: Access the Streamlit interface by opening the appropriate port on the EC2 instance in your web browser. Ensure that you can interact with the interface and input wine parameters to view the predicted wine quality.
By following these steps, you can deploy the wine quality prediction model using FastAPI and Streamlit within a Docker container on an EC2 instance. You can then use Postman to test the FastAPI endpoints and interact with the Streamlit interface. This setup allows multiple users to access the API and interface, providing predictions and a user-friendly way to input wine parameters and view the predicted quality.

Remember to handle security measures, such as securing your endpoints, managing access control, and using HTTPS when exposing APIs.

Here’s an example of testing the API using Postman with the “predict” method.

The live ML app is deployed on AWS EC2. You can access it using the URL provided: URL:http://54.255.220.105:8501.Feel free to use and share it with your friends.

Conclusion and Future Works

We’re able to construct a fairly good model with pretty high accuracy.
Due to the limited time, I only managed to do end-to-end machine learning using a random forest, need to do similar implementations compared every possible regression model and analyzed the best performance between them.
Get more data to test so our model can learn and creates a better prediction.
Do PCA and reduce the data features and made more lean process on our model.
If you have time, please create automation on GitHub using GitHub Actions to make it more convenient.

Dataset:

Red Wine Quality

Simple and clean practice dataset for regression or classification modelling

www.kaggle.com

Source Code on GitHub: https://github.com/ariprachmaan/winepredict.git