Exploratory Data Analysis on Wine Data Set

4 min readSep 7, 2019

What is EDA?

According to Wikipedia, exploratory data analysis (EDA) is an approach to analyze data sets to summarize their main characteristics, often with visual methods.

It helps us to uncover the underlying structure of data and its dynamics through which we can maximize the insights. EDA is also critical to extract important variables and detect outliers and anomalies. EDA is considered one of the most critical parts to understand the data.

Some of the common plots used for Exploratory Data Analysis:

Histograms
Scatter plots
Pair plots
Box plots
Violin plots
Distribution Plots

Let's do EDA on the wine data set. You can download the wine data set from

https://archive.ics.uci.edu/ml/datasets/wine+quality

EDA on the wine data set-

Firstly importing some essential libraries in Python.

Then load the data using the pandas' library.

The shape of the data is (4898,12), which shows there are 4898 rows and 12 columns in the data.

To know the columns of the data, we can do df.columns, it will give all the features name present in the data.

Let’s see some data points present in the data.

The describe () function in Python summarizes statistics. This function returns the count, mean, standard deviation, minimum and maximum values, and the quantiles of the data.

As we can see here, mean value is less than the median value of each column.
There is a large difference between the 75th% tile and max values of residual sugar, free sulfur dioxide & total sulfur dioxide.

Let's check if there is any missing value in the data.

There is no missing data.

df.info return information about the data frame including the data types of each column and memory usage of the entire data.

Data has only float and integer values.

The below-shown function will print the number of unique values in each of the features.

The feature that has a maximum unique value is density.
The feature that has a minimum unique value is quality.

seaborn.catplot — show the relationship between a numerical and one or more categorical variables using one of several visual representations.

“quality” has a high number of values in categories 5, 6 and 7.
Only a few observations are there for the categories 3 & 9.

We can find correlations using pandas “.corr()” function and can visualize the correlation matrix using a heatmap in seaborn.

density has a strong positive correlation with residual sugar, whereas it has a strong negative correlation with alcohol.
pH & fixed acidity has negative correlation.
density & fixed acidity has positive correlation.
citric acid & fixed acidity has positive correlation.
citric acid & volatile acidity has negative correlation.
free sulphur dioxide & total sulphur dioxide has positive correlation.

A box-and-whisker plot displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum.

In a box plot, we draw a box from the first quartile to the third quartile. A vertical line goes through the box at the median. The whiskers go from each quartile to the minimum or maximum.

The figure shown below is the box plot of various features.