Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

The Why And How Of Exploratory Data Analysis In Python

--

EDA | Data Analysis | Python | Pandas |

Exploratory Data Analysis (EDA)

1. What Is Exploratory Data Analysis In Python?

Exploratory Data Analysis (EDA) in Python is the first step in your data analysis process developed by “John Tukey” in the 1970s. In statistics, exploratory data analysis is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. By the name itself, we can get to know that it is a step in which we need to explore the data set.

2. Need For Exploratory Data Analysis

Exploratory Data Analysis is a crucial step before you jump to machine learning or modeling of your data. By doing this you can get to know whether the selected features are good enough to model, are all the features required, are there any correlations based on which we can either go back to the Data Pre-processing step or move on to modeling.

Once Exploratory Data Analysis is complete and insights are drawn, its feature can be used for supervised and unsupervised machine learning modeling.

3. What Are The Steps In Exploratory Data Analysis In Python?

There are many steps for conducting Exploratory data analysis. I want to discuss regarding the below few steps using the Boston Data Set which can be imported from sklearn.datasets import load_boston

  • Description of data
  • Handling missing data
  • Handling outliers
  • Understanding relationships and new insights through plots

a) Description of data:

We need to know the different kinds of data and other statistics of our data before we can move on to the other steps. A good one is to start with the describe() function in python. In Pandas, we can apply describe() on a DataFrame which helps in generating descriptive statistics that summarize the central tendency, dispersion, and shape of a dataset’s distribution, excluding NaN values.

The result’s index will include count, mean, std, min, max as well as lower, 50 and upper percentiles. By default, the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same as the median.

Display first five rows of boston dataset
Descriptive Statistics

b) Handling missing data:

Data in the real-world are rarely clean and homogeneous. Data can either be missing during data extraction or collection due to several reasons. Missing values need to be handled carefully because they reduce the quality of any of our performance matrix. It can also lead to wrong prediction or classification and can also cause a high bias for any given model being used. There are several options for handling missing values. However, the choice of what should be done is largely dependent on the nature of our data and the missing values. Below are some of the techniques:

  • Drop NULL or missing values
  • Fill Missing Values
  • Predict Missing values with an ML Algorithm

Drop NULL or missing values:

This is the fastest and easiest step to handle missing values. However, it is not generally advised. This method reduces the quality of our model as it reduces sample size because it works by deleting all other observations where any of the variables is missing.

The above code indicates that there are no null values in our data set.

Fill Missing Values:

This is the most common method of handling missing values. This is a process whereby missing values are replaced with a test statistic like mean, median or mode of the particular feature the missing value belongs to. Let’s suppose we have a missing value of age in the boston data set. Then the below code will fill the missing value with the 30.

c) Handling outliers:

An outlier is something which is separate or different from the crowd. Outliers can be a result of a mistake during data collection or it can be just an indication of variance in your data. Some of the methods for detecting and handling outliers:

  • BoxPlot
  • Scatterplot
  • Z-score
  • IQR(Inter-Quartile Range)

BoxPlot:

A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of the box to show the range of the data. Outlier points are those past the end of the whiskers. Boxplots show robust measures of location and spread as well as providing information about symmetry and outliers.

Box plot showing Outliers

Scatterplot:

A scatter plot is a mathematical diagram using Cartesian coordinates to display values for two variables for a set of data. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. The points that are far from the population can be termed as an outlier.

Z-score:

The Z-score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured. While calculating the Z-score we re-scale and center the data and look for data points that are too far from zero. These data points which are way too far from zero will be treated as the outliers. In most of the cases a threshold of 3 or -3 is used i.e if the Z-score value is greater than or less than 3 or -3 respectively, that data point will be identified as outliers.

We can see from the above code that the shape changes, which indicates that our dataset has some outliers.

IQR:

The interquartile range (IQR) is a measure of statistical dispersion, being equal to the difference between 75th and 25th percentiles, or between upper and lower quartiles.

IQR = Q3 − Q1.

Once we have IQR scores below code will remove all the outliers in our dataset.

d) Understanding relationships and new insights through plots :

We can get many relations in our data by visualizing our dataset. Let’s go through some techniques in order to see the insights.

  • Histogram
  • HeatMaps
  • pairplot
  • relplot

Histogram:

A histogram is a great tool for quickly assessing a probability distribution that is easy for interpretation by almost any audience. Python offers a handful of different options for building and plotting histograms.

HeatMaps:

The Heat Map procedure shows the distribution of a quantitative variable over all combinations of 2 categorical factors. If one of the 2 factors represents time, then the evolution of the variable can be easily viewed using the map. A gradient color scale is used to represent the values of the quantitative variable. The correlation between two random variables is a number that runs from -1 through 0 to +1 and indicates a strong inverse relationship, no relationship, and a strong direct relationship, respectively.

Correlation value lies between -1 to +1.

Highly correlated variables will have correlation value close to +1

Less correlated variables will have correlation value close to -1

The diagonal elements of the matrix value are always 1 as we are finding the correlation between the same columns.

That’s It!

Thanks for reading!

Found this article useful? Follow me (Anuganti Suresh) on Medium and check out my most popular articles! Please 👏 this article to share it!

References:

Clap if you liked the article!

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Anuganti Suresh
Anuganti Suresh

Written by Anuganti Suresh

Working as Automotive design engineer. Actively looking for change the domain into Data Science. Certified from Simplilearn as “Data Scientist”.

No responses yet