Data Preprocessing and Exploratory Data Analysis for Machine Learning.

Anandaram Ganpathi
Analytics Vidhya
Published in
6 min readMay 18, 2021

--

What’s Exploratory data analysis?

Exploratory data analysis (EDA) is a term for certain kinds of initial analysis and findings done with data sets, usually early on in an analytical process. Some experts describe it as “taking a peek” at the data to understand more about what it represents and how to apply it. Exploratory data analysis is often a precursor to other kinds of work with statistics and data.

Credits: Edvicer

Preprocessing:

In EDA, we will be doing preprocessing of the data by analysing the data either categorical or numerical, visualizing them and some statistical decision.

Let’s see in detail,

First, let’s start with importing and loading the dataset

Let's check datatypes by using info this code will give datatype as well as dimension of the dataset and description of the dataset.

Categorical variable:

The variables containing details like name, gender, address, company, job role, etc are called categorical variables.

Data type: categorical, object

Numerical variable:

The variable that contains details like id, salary, class is called a numerical variable.

Data type: integer, float

The above picture helps you to decide what kind of variable you have

let’s split the categorical and numerical variable by coding

i. numerical data: df.select_dtypes(np.number)

ii. categorical data: df.select_dtypes(object)

After splitting the variables keep them aside for future purposes

Let’s deal with df again.

Now check for null values: df.Isnull().sum()

The above code will give you null values present in the data in column-wise, If it present in the given data you were supposed to do fill in the null values to get good output also process can’t be held in some situation by having null values, if the numerical variable has null values then replace them by mean or median, also if the categorical variable has null values replace them by mode.

Filling null values for numerical columns:

df[column_name] = df[column_name].fillna(df[column_name].mean())

Filling null values for categorical columns:

df[column_name] = df[column_name].fillna(df[column_name].mode())

or some people use KNNImputer, but it’s more likely used in machine learning purpose.

Outliers:

An outlier is a piece of data that is an abnormal distance from other points. In other words, it’s data that lies outside the other values in the set. There are many ways to find outliers and mostly used technique by visualization boxplot.

i.e. now we can able to see a lot of outliers present in the dataset after using boxplot.

This picture explains the boxplot with the behind math present in it.

Outliers treatment by code:

Q1 = df.quantile(.25)

Q3 = df.quantile(.75)

IQR = Q3-Q1

df = df[~((df<(Q1–1.5*IQR)) | (df>(Q3+1.5*IQR))).any(axis=1)]

The above code will help you to remove the outliers

Population vs Sample data:

The population is the entire data, the sample is the subset of the population. it’s not necessary to have an entire characteristic from the population during the sample.

After filling the null values use distplot and removing the outliers lets see the normality of the data

for that you have to use df.skewness(), this code will tell you either your code normally distributed or skewed

The following picture tells how the data are distributed:

Right skewed (Positively skewed)
Left skewed (Negatively skewed)

so now we can start working on data visualisation.

Barplot:

Show point estimates and confidence intervals as rectangular bars. A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars.

Histplot:

The histplot is so likely as the same bar plot but there is no distance gap between the two bars.

Heatmap:

Heat map analysis is the process of reviewing and analyzing heat map data to gather insights about user interaction on the page. This data analysis can lead to improved site designs with lower bounce rates, fewer drop-offs, more pageviews, and better conversion rates. Heatmap is basically working with correlation values so it’s also used to the correlation between various variables.

Correlation between Multiple variables (Heatmap)

The above figure shows the correlation between various variable, The highly correlated areas with darker blue in colour and lowly correlated with creamy white in colour.

Pairplot:

To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This shows the relationship for (n,2) combination of variable in a DataFrame as a matrix of plots and the diagonal plots are the univariate plots

Pairplot

Scatterplot:

A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The example scatterplot figure is shown below.

The relation between the scatterplot is Gross and No. of users voted for the movie with separation of imbd_score.

LMplot:

The LMplot is the same as the scatterplot but with the line of regression model fits across a FacetGrid. This plot is mostly used for Machine Learning purposes in Supervised learning to see the best fit.

Gross vs Budget with best fit line

Distplot:

The distplot is the same as the combination of Kdeplot and histplot.

Jointplot:

Jointplot displays a relationship between 2 variables (bivariate) as well as 1D profiles (univariate) in the margins. This plot is a convenience class that wraps JointGrid.

The relation between gross and budget are show using Jointplot

There are more and more visualizations plot are available in seaborn and matplotlib, but these are the most widely used plots in the visualizations sector. Also, I have attached my jupyter notebook in Github have a look.

https://github.com/anand-lab-172/Data_Preprocessing_and_Exploratory_Data_Analysis_for_Machine_Learning

Thanks for reading. :)

And,💙 if this was a good read. Enjoy!

--

--

Anandaram Ganpathi
Analytics Vidhya

I’m a Big Data Engineer and Data Scientist with good programming knowledge and skills in Python, SQL, Machine Learning, Google Cloud, Tableau, EDA, Talend.