DATA Exploration & Preparation | Data Science | Machine Learning

Jaya Raghavendra

7 min readDec 30, 2019

When to start Data Exploration

We start with the Hypothesis.

Data Exploration

Understand your data first!

1.Variable Identification
2.Univariate Analysis
3.Bivariate Analysis

1.Variable Identification

Given Business Problem Identify!

2.Univariate Analysis

Explore variables one by one depending on the variable type (categorical or continuous) and also used to highlight missing and outlier values.

Continuous variable :

1.Central tendency
2.Spread of variable

Categorical variable :

1.Frequency table
2.Bar chart

3. Bivariate Analysis

To know the relationship between two variables

Continuous & Continuous: With Scatterplot

Correlation = Covariance(X,Y) / SQRT( Var(X)* Var(Y))

Chi-Square Test

To derive the statistical significance of the relationship between the categorical variables.

Probability of 0: It indicates that both categorical variables are dependent Probability of 1: It shows that both variables are independent

Categorical & Continuous: Z-Test/ T-Test
To assess whether the mean of two groups are statistically different from each other or not.

•If the probability of Z is small then the difference of two averages is more significant.
•The T-test is very similar to Z-test but it is used when a number of observations for both categories are less than 30.

Data Preparation

How your data should be for modeling?

1.Missing values treatment
2.Outlier treatment
3.Variable transformation
4.Variable creation

1.Missing Values Treatment

How Will It Effect?

Reasons for missing values :
Missing completely at random
Missing that depends on unobserved predictors

Handling missing values:

1.Deletion: If the missing values are less than 5% of your data then delete them
2.Mean/ Mode/ Median Imputation: Mean or Median (numeric attribute) or Mode (categorical attribute)
3.KNN Imputation: Attributes that are most similar to the attribute whose values are missing are selected using a distance function and it is very time-consuming in analyzing a large database

Imputation of Missing Data, How?

2.Outlier Treatment

Outlier is an observation that appears far away and diverges from an overall pattern in a sample.

•It increases the error variance and reduces the power of statistical tests
•If the outliers are non-randomly distributed, they can decrease normality

How to detect Outliers? Visualizations are the best methods to identify outliers

How to remove Outliers?

Deleting observations: If outlier observations are very small in numbers then we delete outlier values
2. Treat separately: Take all the significant number of outliers and work on them separately
3.Transforming: This is done to reduce the variation caused by extreme values
4.Binning values: To group a number of more or less continuous values into a smaller number of “bins”

Example: if you have data about a group of people, you might want to arrange their ages into a smaller number of age intervals

3.Feature Engineering

Predictive power of several models improves significantly with the application of feature engineering.

What is Feature Engineering?
Extraction of as more info from the existing data by making data we already to more useful data

Business Problem: Footfall reduction in a shopping mall?

Process of Feature Engineering

•Variable transformation
•Variable / Feature creation
*Give your most of the time to do these processes*

Variable transformation

Why Transform Your Data?
Some machine learning algorithms require the data to be in a specific form.

Transformations Varieties:

How to transform?

Data Pre-Processing Methods

For which algorithms we use data transformation:

Instance-based methods: Needs Scale data
(KNN, LVQ, Support vector machines and Neural networks)

Regression methods: Needs Standardized data
Not so useful for tree and rule-based methods.

Scale Data
Scale transform calculates the standard deviation for an attribute
and
Divides each value by that standard deviation.

The values it creates are known under being z-scores, this is a method of standardization.

2.Center Data
The center transform calculates the mean for an attribute and subtracts it from each value.

Find the mean of Sepal length
Subtract that mean with each record of Sepal length

3.Standardize Data

Combining the scale and center transforms will standardize your data. Attributes will have a mean value of 0 and a standard deviation of 1.

Makes your data how many standard deviations from the average that that data lies

4.Normalization Data

Data values can be scaled into the range of 0 and 1

Power Transforms

1. Box-Cox Transform

When it is used:
If the data is skewed and to convert that data into Normal distribution

Note: Assuming values are positives

2. Yeo-Johnson Transform

Same as Box-Cox Transform but it can take zero and negative values

important points to keep in mind
1. Don’t skip over this step as it has a huge impact on the accuracy of your final models.
2.First view your data with summary function summary ( )
3. It is a good idea to visualize the distribution of your data

Variable Creation

Creation of new variables based on existing variable(s) that may have a better relationship with the target variable.

Creating dummy variables:
Used for Regression Analysis with Categorical Variable

Convert categorical variables into numerical variables with given all importance to each other.
Create dummy variables for more than two classes of categorical variables with n or n-1 dummy variables.