Data Preprocessing and Exploratory Data Analysis for Machine Learning.
What’s Exploratory data analysis?
Exploratory data analysis (EDA) is a term for certain kinds of initial analysis and findings done with data sets, usually early on in an analytical process. Some experts describe it as “taking a peek” at the data to understand more about what it represents and how to apply it. Exploratory data analysis is often a precursor to other kinds of work with statistics and data.
Preprocessing:
In EDA, we will be doing preprocessing of the data by analysing the data either categorical or numerical, visualizing them and some statistical decision.
Let’s see in detail,
First, let’s start with importing and loading the dataset
Let's check datatypes by using info this code will give datatype as well as dimension of the dataset and description of the dataset.
Categorical variable:
The variables containing details like name, gender, address, company, job role, etc are called categorical variables.
Data type: categorical, object
Numerical variable:
The variable that contains details like id, salary, class is called a numerical variable.
Data type: integer, float
The above picture helps you to decide what kind of variable you have
let’s split the categorical and numerical variable by coding
i. numerical data: df.select_dtypes(np.number)
ii. categorical data: df.select_dtypes(object)
After splitting the variables keep them aside for future purposes
Let’s deal with df again.
Now check for null values: df.Isnull().sum()
The above code will give you null values present in the data in column-wise, If it present in the given data you were supposed to do fill in the null values to get good output also process can’t be held in some situation by having null values, if the numerical variable has null values then replace them by mean or median, also if the categorical variable has null values replace them by mode.
Filling null values for numerical columns:
df[column_name] = df[column_name].fillna(df[column_name].mean())
Filling null values for categorical columns:
df[column_name] = df[column_name].fillna(df[column_name].mode())
or some people use KNNImputer, but it’s more likely used in machine learning purpose.
Outliers:
An outlier is a piece of data that is an abnormal distance from other points. In other words, it’s data that lies outside the other values in the set. There are many ways to find outliers and mostly used technique by visualization boxplot.
i.e. now we can able to see a lot of outliers present in the dataset after using boxplot.
This picture explains the boxplot with the behind math present in it.
Outliers treatment by code:
Q1 = df.quantile(.25)
Q3 = df.quantile(.75)
IQR = Q3-Q1
df = df[~((df<(Q1–1.5*IQR)) | (df>(Q3+1.5*IQR))).any(axis=1)]
The above code will help you to remove the outliers
Population vs Sample data:
The population is the entire data, the sample is the subset of the population. it’s not necessary to have an entire characteristic from the population during the sample.
After filling the null values use distplot and removing the outliers lets see the normality of the data
for that you have to use df.skewness(), this code will tell you either your code normally distributed or skewed
The following picture tells how the data are distributed:
so now we can start working on data visualisation.
Barplot:
Show point estimates and confidence intervals as rectangular bars. A bar plot represents an estimate of central tendency for a numeric variable with the height of each rectangle and provides some indication of the uncertainty around that estimate using error bars.
Histplot:
The histplot is so likely as the same bar plot but there is no distance gap between the two bars.
Heatmap:
Heat map analysis is the process of reviewing and analyzing heat map data to gather insights about user interaction on the page. This data analysis can lead to improved site designs with lower bounce rates, fewer drop-offs, more pageviews, and better conversion rates. Heatmap is basically working with correlation values so it’s also used to the correlation between various variables.
The above figure shows the correlation between various variable, The highly correlated areas with darker blue in colour and lowly correlated with creamy white in colour.
Pairplot:
To plot multiple pairwise bivariate distributions in a dataset, you can use the pairplot() function. This shows the relationship for (n,2) combination of variable in a DataFrame as a matrix of plots and the diagonal plots are the univariate plots
Scatterplot:
A scatter plot is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. The example scatterplot figure is shown below.
The relation between the scatterplot is Gross and No. of users voted for the movie with separation of imbd_score.
LMplot:
The LMplot is the same as the scatterplot but with the line of regression model fits across a FacetGrid. This plot is mostly used for Machine Learning purposes in Supervised learning to see the best fit.
Distplot:
The distplot is the same as the combination of Kdeplot and histplot.
Jointplot:
Jointplot displays a relationship between 2 variables (bivariate) as well as 1D profiles (univariate) in the margins. This plot is a convenience class that wraps JointGrid.
There are more and more visualizations plot are available in seaborn and matplotlib, but these are the most widely used plots in the visualizations sector. Also, I have attached my jupyter notebook in Github have a look.
Thanks for reading. :)
And,💙 if this was a good read. Enjoy!