Exploratory Data Analysis

Vivek Menon
Analytics Vidhya
Published in
6 min readSep 3, 2020

--

When we hear about Data science or Analytics , the first thing that comes to our mind is Modelling , Tuning etc. . But one of the most important and primary steps before all of these is Exploratory Data Analysis or EDA.

Exploratory data analysis (Machine learning process steps)

Why EDA

In Data Science one of the Major problem Data Scientists/Analysts are facing today is the Data Quality . Since we rely on multiple sources for data , data quality is often compromised.The quality of Data determines the quality of models which we are building on it .As the adage goes,Garbage in , garbage out . The above statement holds very true in the case of Data science.

We cannot build Empire State Building or Burj Khalifa on a shaky foundation !

And that explains why 60–80% of time of Data Scientists are being spent on Data gathering and Data preparation.

When we are working with Data , EDA or Exploratory Data Analysis is the most important step .It is very important to gather as much information and insights from data as we could before processing it . This could be done by EDA. EDA Also help us to analyse the underlying trends and patterns in data and also help us to formulate our problem statement in a better way .

Well begun is half done”

Exploratory Data Analysis helps to understand the data better and also it helps to understand what Data speaks.This could be done both by visual analysis as well as with few other analysis.Also EDA helps to distinguish between what to be pursued further and what is not worth following up.

Exploratory Data Analysis

Let’s explore steps of Exploratory data analysis using Bank loan Data set

Import the Libraries:

To perform initial analysis , we would need libraries like Numpy, Pandas,Seaborn and Matplotlib. Numpy is an array processing package.Its a library for numerical computations .Pandas is used for data manipulation and analysis. Matplotlib and Seaborn are statistical libraries used for data visualization

Import Dataset:

Data is stored in csv file format, hence we are importing it using pd.read_csv

Imported data from the file is stored in bankloan_df dataframe

Information of data set:

.info() will display information about the data frames

It shows the column names,number of rows and columns, data types etc.It gives an idea about what type of data it is .It is very important to understand whether a column represents categorical or numerical variable , if categorical we should understand whether its ordinal or nominal .We need to treat each of these data types differently which I will explain in another post.You can use .astype to change the datatype of a column

If need to know only the number of rows and columns .shape can be used

To see the data type , bankloan_df.dtypes can be used

To check the null values bankloan_df.isnull().sum() can be used

Descriptive Analysis :

.describe() is used for descriptive analysis , it provides details like count, mean, standard deviation, Inter Quartile Range etc.This analysis helps to understand the skewness of data.

In the case of categorical variables,to check the representation of different groups , we use groupby. This is used to analyze whether any group is over represented than other . If such under representation is there for target variable, we need to treat it with certain techniques like SMOTE.

Graphical analysis:

Graphs are very important tool to understand the data distribution .We use different graphs for analyzing data. We use it for Univariate, Bi Variate and Multi Variate Analysis. Seaborn is a very good library to explore different graphs. I will explain few very common graphs in the analysis here and will write a post in detail about graphs later.

Uni variate Analysis — Analysis where we consider only one variable. Few uni variate graphs are Count Plot, Box Plot etc.

Countplot:-Countplot shows the counts of observations in each category using bars

Boxplot:-A box plot (or box-and-whisker plot) shows the distribution of quantitative data.The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution, except for points that are determined to be “outliers” using a method that is a function of the inter-quartile range.

To identify outliers also we use boxplots

Bi Variate Analysis is where relationship between two variables are plotted in the graph and in Multi variate Analysis , relationship between different variables represented using graphs.

Pairplot is a Bi Variate graph which is used to analyse the relationship between different variables in a dataset. This is very important step for Model building.

Correlation

Correlation is another important step of EDA. While building a model, its important to understand whether any correlation exists between the independent variables and also with independent variable and dependent variable. This also helps in feature selection/elimination.

Values closer to +1 and -1 are considered as maximum correlated variables.The values in diagonal is the correlation of variable with itself and it will always be +1.

Correlation graphs can be designed using the below code snippet

These are initial few steps of Exploratory data analysis. Based on the findings of each step, one can take appropriate action to improve data quality, analyse the trend or to treat missing variables/Outliers or anomaly appropriately.

“Information is the oil of the 21st century, and analytics is the combustion engine.” — Peter Sondergaard, Gartner Research

--

--

Vivek Menon
Analytics Vidhya

An Enthusiastic Learner in the field of Artificial Intelligence, Machine Learning and Neural Networks. https://www.linkedin.com/in/viveksmenon/