Exploratory Data Analysis (EDA) in Python

Atanu Dan
7 min readOct 18, 2020

--

Introduction

Exploratory Data Analysis or (EDA) is understanding the data sets by summarizing their main characteristics often plotting them visually. This step is very important especially when we arrive at modeling the data in order to apply Machine learning. Plotting in EDA consists of Histograms, Box plot, Scatter plot and many more. It often takes much time to explore the data. Through the process of EDA, we can ask to define the problem statement or definition on our data set which is very important.

For data analysis, Exploratory Data Analysis (EDA) must be your first step. Exploratory Data Analysis helps us to –

· To give insight into a data set.

· Understand the underlying structure.

· Extract important parameters and relationships that hold between them.

· Test underlying assumptions

It is a good practice to understand the data first and try to gather as many insights from it. EDA is all about making sense of data in hand, before getting them dirty with it

Understanding EDA using sample Data set

To understand EDA using python, we can take the sample data either directly from any website. I’m taking the sample data on Housing dataset. This Dataset and code is available in this github link https://github.com/atanudan/EDA

  1. Importing the required libraries for EDA:

Below are the libraries that are used in order to perform EDA (Exploratory data analysis) in this tutorial. The complete code can be found on my GitHub.(https://github.com/atanudan/EDA/blob/main/EDA_City_Data)

2. Loading the data into the data frame: Loading the data into the pandas data frame is certainly one of the most important steps in EDA. Read the csv file using read_csv() function of pandas library and each data is separated by the delimiter “,” in given data set.

Return the first five observation from the data set with the help of “.head” function provided by the pandas library. We can get last five observation similarly by using the “.tail()” function of pandas library

3. We can get the total number of rows and columns from the data set using “.shape” like below:

4. Checking the types of data -To find what all columns it contains, of what types and if they contain any value in it or not, with the help of info() function.

By observing the above data, we can conclude −

· Data contain 4 float ,18 integer values and only one contain object value

· All the columns variable are non-null (no-empty or missing value).

5. Dropping the duplicate rows: In this dataset there is no duplicate row

6. To find out the unique value of the selected column use unique() function

OutPut:

From above data, we can conclude that there are certain features that are displayed as integer but we need to convert them as Categorical features. Example: Bed Rooms, Bath Rooms, Coast, sight, condition, quality, furnished etc. But Basement features can’t be consider as categorical.

7. To analysis the outlier whether the row will be removed or only 33 value will be replaced

8. Add more Features: In this data set we can extract yr_sold value from dayhours column and add yr_sold as new column into this dataframe

9. Now we have to change the feature from Int to Categorical Features using pandas Categorical() function

Again we will check info() to get all columns data types. And it should displayed categorical features also

10. Another useful function provided by pandas is describe() which provides the count, mean, standard deviation, minimum and maximum values and the quantities of the data.

From above data, we can conclude that there are some columns which mean value is less than the median value (50%) in index and there are some columns which mean value is greater than median value (50%).

There is a huge difference between the 75% and max values of predictors “room_bath”, “ceil_measure”, “living_measure15“, lot_measure15”.

Above four observations, gives an indication that there are extreme values- deviations in our data set. We need to find the outliers.

11. Find out Outliers: We know Q3 AND Q1 AND IQR=Q3-Q1, any data point which is less than Q1–1.5IQR or Q3+1.5IQR are consider as outlier.

Here is the function which will return outliers values given column

12. To Analyze Continuous Variables Column get the outlier count

As seen above there were around 500+ rows were outliers

13. Data Visualizations:

To check Missing Values :

From above we can see there is no missing values in the dataset. Incase if there is any, we would have seen figure represented by different color shade on purple background.

·

Analyze individual column: Using the below function we can easily analyze to visualize the distribution of the data, detection of the outlier

We can analyze 2 columns in a figure:

· Plot different features against one another (scatter), against frequency (histogram):

Histogram — Histogram refers to the frequency of occurrence of variables in an interval.

14. Categorical variable analysis — Now we will understand how data is distributed in categorical feature. Let’s take an example coast.

Plotting Bar Plot:

Check for Condition column:

Since condition 1&2 the count is less then we can merge these 2 into 1 column same 4 & 5 is also combined that way we can reduced the level of condition

Now check for Quality Column:

So here 0–5 merged into a level, and 10–13 also merged into another level

Now Check for yr_built column:

15. Bi-Variate Analysis:

HeatMaps: Heat Maps is a type of plot which is necessary when we need to find the dependent variables. One of the best way to find the relationship between the features can be done using heat maps

HeatMaps

· Above, positive correlation is represented by dark shades and negative correlation by lighter shades.

· Changes the value of annot=True, and the output will show you values by which features are correlated to each other in grid-cells.

· From above we can see, there is a strong positive correlation of density with price and living_measure,lat. However, a strong negative correlation of price and age_sold.

· Also, there is no correlation between zip code and price.

Plots between independent variables and price that is target:

This is to understand how price is changing based on different value of room_bed and room_bath.

Find the co-relation between living measure and price:

Bivariate analysis for independent variable being a category and dependent variable being a number:

Customize Bin Size:

Hence the above are some of the steps involved in Exploratory data analysis, these are some general steps that you must follow in order to perform EDA.

--

--