Wine Quality Exploration with R

Weng Seng
5 min readSep 3, 2019

In this post, I will highlight the exploratory data analysis (EDA) with R to explore relationships in one variable to multiple variables and to discover for distributions, outliers, and anomalies.

The full details of the project can be found here.

Table of Contents

1. Introduction
2. Uni-variate Exploration
3. Bi-variate Exploration
4. Multivariate Exploration
5. Final Plots and Summary
6. Reflection

1. Introduction

This white wine data set is public available for research. The details are described in [Cortez et al., 2009].

For more information about the characteristic of the chemical composition for the features can refer this website. This website did give us the insight how the chemical composition work and how possible it affect the flavor of the wine.

The data set it consists of 12 variables:

1.fixed acidity (tartaric acid — g / dm³)
2. volatile acidity (acetic acid — g / dm³)
3. citric acid (g / dm³)
4. residual sugar (g / dm³)
5. chlorides (sodium chloride — g / dm³)
6. free sulfur dioxide (mg / dm³)
7. total sulfur dioxide (mg / dm³)
8. density (g / cm³)
9. pH
10. sulphates (potassium sulphate — g / dm3)
11. alcohol (% by volume)
12. quality (score between 0 and 10)

2. Uni-variate Exploration

I will selected some of the variables from the total 12 variables to show the histogram plots. This is because that is a lot if I shows all 12 variables but you can find all the plots in my github.

pH histogram

The pH value in between 2.72 and 3.82 which mean the white wine from the data set are pretty sour and most of the pH around 3.2

alcohol histogram

Overall the alcohol histogram do shows a decreasing trend in count along the alcohol level from 9g/L.

quality histogram

The histogram shows normal distribution. From the quality plot there is no perfect quality wine and undrinkable wine. Most of the quality of white wine are average.

3. Bi-variate Exploration

ggpairs plot

From the ggpairs plot density vs sugar and density vs alcohol show strong correlation which is 0.839 and -0.78 repectively. So I will shows scatter plot on density vs sugar and density vs alcohol next.

Scatter Plot: Density vs Sugar

Density vs sugar shows positve correlation which tell as sugar level increase density also increase.

Scatter Plot: Density vs Alcohol

Density vs alcohol shows negative correlation which alcohol level decrease density decrease as well.

Quality with pH boxplot
Quality with alcohol boxplot

From the boxplots above, higher alcohol level tend to give better white wine quality. The boxplot with pH feature shows higher pH value the better the quality in narrow range of pH value but not obvious as alcohol.

4. Multivariate Exploration

The quality histogram plot with pH above shows normal distribution for quality on 5, 6 and 7.

The quality histogram plot with alcohol above shows quality concentrated on 5, 6 and 7 and not clear how the distribution vary from quality.

Higher quality (quality = 7) wines are concentrated on the right bottom corner in the scatter plot which represent low density and high alcohol.

Higher quality (quality = 7) wines are concentrated on the below quality equal to 5 in density across the the sugar level in the scatter plot. Which mean low density and low sugar level give better quality.

Alcohol and sugar level that I observed both are strengthened each other to give better quality.

5. Final Plots and Summary

The plot the density shows negative correlation with alcohol meaning that high quality of white wine are low in density and high in alcohol level.

The plot shows positive correlation between residual.sugar and density. Better quality wines are low sugar level and density.

6. Reflection

In uni-variate plots I manage to plot the histogram without any difficulty. However I do face some problems when plotting the features with quality where quality format originally are integer. After I realized this issues then I changed to factor.

I believe multivariate plots are better to show the significant relationship between different features impact the quality.

--

--