Introduction to Exploratory Data Analysis (EDA)

Published in

The Startup

7 min readFeb 20, 2021

To share my understanding of the concept and techniques I know, I’ll take an example of House Prices dataset which is available on Kaggle and try to catch hold of as many insights from the data set using EDA.

Here is a quick overview of the things that you are going to learn in this article:

Descriptive Statistics
Outlier Treatment
Grouping of Data
Handling missing values in dataset
Correlation

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats

Descriptive Statistics

Descriptive Statistics helps to describe the basic features of dataset and obtain summary of the data also know as 5 point summary.

Median: The middle value in the Columns , also called 50th percentile. or 2nd quartile.
1st Quartile: The 25th percentile.
3rd Quartile: The 75th percentile.
Minimum : The smallest observation in columns.
Maximum: The Largest Observation in columns.

The describe method in Pandas helps us to have summary of the dataset for all numerical columns excluding NaN(Not-a-Number or missing ) values.

# gives include='all' gives additional summary of the data
df.describe(include=’all’)

.describe() in pandas helps us to get brief overview of the data. For eg. we can observe that LotFrontage has 1460 values in it which means there is no null value in this variable and mean value is 70 and 50th percentile at 69.0 we can also conclude the there is a slight skewness in the variable to verify this one can also use .skew() method in Pandas.

Univariate Analysis Plots

Plots which can be used for numeric variable analysis

Histogram
KDEplot (kernel density estimate)
Distplot
Boxplot
Violin plot

Plots which can be used for categorical variable analysis

Barchart
Piechart

Box Plot

A boxplot is a standardized way of displaying the distributions of the data based of five point summary. It can tell you about outliers and what their values are. It also gives the idea about the skewness of the data.

The upper and lower Quartiles represents the 75th and 25th percentile of the data.

sns.boxplot(x='GarageQual',y=‘SalePrice’,data=df)
plt.show()

We can observe that house with TA have multiple outliers, houses with poor GarageQuality have a price range of $100,000 to $150,000 approx , houses with with Ex are selling for more than $150,000 to $450,000. it tells us that avergae price of Good Garage Quality will cost somewhere around $250,000.

Violin plot

Violin plots is a method of plotting numeric data and can be considered a combination of the box plot with a kernel density plot. In the violin plot, we can find the same data as in the box plots

The advantage of the violin plot over the box plot is that aside from showing the statistics it also shows the entire distribution of the data. This is of interest, especially when dealing with multimodal data, i.e., a distribution with more than one peak.

plt.figure(figsize=(15,5))
sns.violinplot(x='KitchenQual','SalePrice',hue='CentralAir',data=df,split=True)
plt.show()

We can make a inference that most of the houses in GD category are priced at $200,000 which are having Central Air , housing not having Central Air are sold for sell than $180,000 approx. , this phenomenon can be observed in every other category of Kitchen Quality.

Outlier Treatment

An outlier is a data point in a data set that is distant from all other observations. A data point that lies outside the general distribution of the dataset.

Formula for Z score = (Observation — Mean)/Standard Deviation

Finding 1st quartile and 3rd quartile

q1, q3= np.percentile(dataset,[25,75])

Find the IQR which is the difference between 3rd and 1st quartile

iqr = q3 - q1

Find lower and upper bound

lower_bound = q1 -(1.5 * iqr) 
upper_bound = q3 +(1.5 * iqr)

We will use z_score for the treatment of outliers.

Z-scores may be positive or negative, with a positive value indicating the score is above the mean and a negative score indicating it is below the mean.

Any value which is above +3 sd (Standard deviations) and -3 sd is considered as outlier.

Detect Outlier (z_score)

df['Zscore_SalePrice'] = stats.zscore(df['SalePrice'])df[(df['Zscore_SalePrice'] < -3) | (df['Zscore_SalePrice'] > 3)]

The ouput will show 298 rows and 10 columns

Remove Outlier

df[(df['Zscore_SalePrice'] > -3) & (df['Zscore_SalePrice'] < 3)]

For the purpose of demonstration we will remove these outliers from the dataset, and do further analysis on our new dataset.

Scatter Plots

A scatter plot represents association betweeen two variables, if the variables tends to increase or decrease together the association is said to be posivitive. If one variable increase and other decrease it is said to have negative relation.If there is no pattern the relation is zero.

sns.scatterplot( df['column_n'] , df['column_n'] )

# Hue- 'Grouping variable that will produce points with different colors. Can be either categorical or numeric, although color mapping will behave differently in latter case.'sns.scatterplot(df['GarageArea'],df['SalePrice'],hue=df['Foundation'])

We can observe that price of houses depends upon the size of Garage Area they have but not that much on Foundation. From the above graph we can also conclude there is a linear relationship between GarageArea and SalePrice of house , houses with bigger garage sells for more compared to houses with wither no garage or small garage area.

Grouping of Data

Assume we want to know the average price of apartment which has Garage,Central Air Conditioner and observe how SalePrice differ from each other. A nice way to do this would be to group data according to SalePrice and GarageType , CentralAir.

df.groupby([‘GarageType’,’CentralAir’])[‘SalePrice’].mean()

From this output we can clearly see that apartment having Central Air is more expensive than the apartment not having.

We can also visualise this in form of bar graph

df.groupby(['GarageType','CentralAir'])['SalePrice'].mean().unstack(1).plot.barh()

Handling missing values

Missing values are those rows or columns which have no data recorded in particular observation. Analysing these values is important as this may lead to weak or biased analysis.

We can handle missing values in many ways:

Delete:With dropna() method from Pandas library can be used to delete rows and columns , one can delete entire row by axis=0 or columns by axis=1

df['Column_n'] = df['columns_n'].dropna(inplace=True, axis=0)

df['Exterior2nd'] = df['Exterior2nd'].dropna(inplace=True, axis=0)

Impute: Deleting data might cause huge amount of data loss, so replacing it might prove to be a better coption than deleting. For imputation of missing values one can use fillna() method and replace the missing values with mean, median of that particular column as per requirement.

df['column_n']= df['column_n'].fillna(df['column_n'].mean(),inplace=True )

df['FireplaceQu'] = df['FireplaceQu'].fillna(0,inplace=True )

0 is used for the purpose of imputation because houses which don’t have fireplace are showing NaN values so it is logical to fill this with 0, for other variables central tendency can be used to fill null values.

Predictive filling: Use interpolate() method is will perform a linear interpolation in order to fill missing values , it uses multiple methods to fill the missing values like linear, time, index, values, nearest, zero, etc

df.interpolate(method ='linear', limit_direction ='forward')

HeatMap

We will generate heatmap of the output of isnull() in order to detect missing values

sns.heatmap(df.isnull())
plt.show()

Correlation

The correlation coefficient is a statistical measure of the strength of the relationship between the relative movements of two variables.

In other words, when compare two variables , if one variable changes, how does this effect change in the other variable?

The values range between -1.0 and 1.0. A correlation of -1.0 shows a perfect negative correlation, while a correlation of 1.0 shows a perfect positive correlation.
A correlation of 0.0 shows no linear relationship between the movement of the two variables.
For eg. smoking is known to be correlated with lung cancer. Since, smoking increases the chances of lung cancer.

Correlation does not imply Causation

plt.figure(figsize=(15,7))
# df.corr() creates a correlation matrix 
corr_matrix = df.corr()# helps in creating a 0's matrix of df.corr() shape
mask = np.zeros_like(corr_matrix)# Returns copy of array with lower part of the triangle
mask[np.triu_indices_from(mask)] = True
with sns.axes_style("white"):
    sns.heatmap(corr_matrix, mask=mask, square=True)

In our case due to too many variables in dataset it becomes difficult to visualise each variable we can also create a correlation matrix for the same and work on each variable independently for better analysis.

Form the above plot we can see that GrLivArea and SalePrice has positive corelation (score of 0.70862) with each other while , in other words it tells us that with increase in size of GrLivArea the cost of house will increases

plt.figure(figsize=(10,5))
sns.regplot(x='GrLivArea',y='SalePrice',data=df)

The above plot shows the positive correlation between GeLivArea size and SalePrice.

This was a brief introduction to Exploratory Data Analysis.