In this post, I will highlight the exploratory data analysis (EDA) with R to explore relationships in one variable to multiple variables and to discover for distributions, outliers, and anomalies.
The full details of the project can be found here.
Table of Contents
1. Introduction
2. Uni-variate Exploration
3. Bi-variate Exploration
4. Multivariate Exploration
5. Final Plots and Summary
6. Reflection
1. Introduction
This white wine data set is public available for research. The details are described in [Cortez et al., 2009].
For more information about the characteristic of the chemical composition for the features can refer this website. This website did give us the insight how the chemical composition work and how possible it affect the flavor of the wine.
The data set it consists of 12 variables:
1.fixed acidity (tartaric acid — g / dm³)
2. volatile acidity (acetic acid — g / dm³)
3. citric acid (g / dm³)
4. residual sugar (g / dm³)
5. chlorides (sodium chloride — g / dm³)
6. free sulfur dioxide (mg / dm³)
7. total sulfur dioxide (mg / dm³)
8. density (g / cm³)
9. pH
10. sulphates (potassium sulphate — g / dm3)
11. alcohol (% by volume)
12. quality (score between 0 and 10)
2. Uni-variate Exploration
I will selected some of the variables from the total 12 variables to show the histogram plots. This is because that is a lot if I shows all 12 variables but you can find all the plots in my github.
The pH value in between 2.72 and 3.82 which mean the white wine from the data set are pretty sour and most of the pH around 3.2
Overall the alcohol histogram do shows a decreasing trend in count along the alcohol level from 9g/L.
The histogram shows normal distribution. From the quality plot there is no perfect quality wine and undrinkable wine. Most of the quality of white wine are average.
3. Bi-variate Exploration
From the ggpairs plot density vs sugar and density vs alcohol show strong correlation which is 0.839 and -0.78 repectively. So I will shows scatter plot on density vs sugar and density vs alcohol next.
Density vs sugar shows positve correlation which tell as sugar level increase density also increase.
Density vs alcohol shows negative correlation which alcohol level decrease density decrease as well.
From the boxplots above, higher alcohol level tend to give better white wine quality. The boxplot with pH feature shows higher pH value the better the quality in narrow range of pH value but not obvious as alcohol.
4. Multivariate Exploration
The quality histogram plot with pH above shows normal distribution for quality on 5, 6 and 7.
The quality histogram plot with alcohol above shows quality concentrated on 5, 6 and 7 and not clear how the distribution vary from quality.
Higher quality (quality = 7) wines are concentrated on the right bottom corner in the scatter plot which represent low density and high alcohol.
Higher quality (quality = 7) wines are concentrated on the below quality equal to 5 in density across the the sugar level in the scatter plot. Which mean low density and low sugar level give better quality.
Alcohol and sugar level that I observed both are strengthened each other to give better quality.
5. Final Plots and Summary
The plot the density shows negative correlation with alcohol meaning that high quality of white wine are low in density and high in alcohol level.
The plot shows positive correlation between residual.sugar and density. Better quality wines are low sugar level and density.
6. Reflection
In uni-variate plots I manage to plot the histogram without any difficulty. However I do face some problems when plotting the features with quality where quality format originally are integer. After I realized this issues then I changed to factor.
I believe multivariate plots are better to show the significant relationship between different features impact the quality.