Part-2 : Data preparation made easy with python!!

Published in

Analytics Vidhya

7 min readMar 21, 2020

In continuation to my earlier article : Part-1 : Data preparation made easy with python!! Let’s dig into bit more EDA and see how we can process numerical and outlier values in dataset. So what we are waiting for let’s get started.

Step 5 : Handling the numerical data

Scaling

Well let me first explain why scaling the data is important. You understand how 1- kg is same as 1000 gm or 1- km is same as 1000 meters but your machine doesn’t, suppose a feature is denoted in Kg while another in gm, your system just sees them as number and do its processing on it, it’s our job to make machine understand units importance . Since most of our algorithm uses distance calculation, scaling will give correct direction for computation. Post this topic go back to your data and see if it is scaled and if not, scale it and see performance difference in algorithm. Let’s see types of scaling with python code.

With theory let’s also practice below scaling technique on data, this data is about top 50 popular songs , and have features like Beats.Per.Minute and ranges from 85 to 190 while feature like Acousticness ranges from 1 to 75, and if are going to study these two features together , a better comparison be on same scale.

Top 10 rows of dataset

#Method 1 : Standardization

Standardization is the process of transforming the data that centers the data by removing the mean value of each feature and then scale it by dividing (non-constant) features by their standard deviation. It assumes your data is normally distributed within each feature, with a mean 0 and standard deviation of 1.

You can check data’s distribution using seaborn library but there are many other ways as well, which you are free to explore. Below is a plot with seaborn library.

normal distribution plot

The diagonal data shows histogram distribution plot, most of the data doesn’t look normalized, you can use log on the data which sometimes help reshape data, please do explore other option to reshape distribution of data and make it normalized before you standardized it. For now just Standardized the numeric data.

Standard scaling

#Method 2: Normalization

Normalization is the process of transforming the data by scaling individual samples to have unit norm. It is often called as Min-Max scaling which basically shrinks the range of data values between 0 and 1. It works well when distribution is not Gaussian or Standard deviation is quite small.

It is highly sensitive to outliers, so make sure to use this scaling technique your data is recommended to be outlier free.

You can also use mean way of normalizing by just replacing X-minimum in numerator to X-mean.

Let’s normalize the data using python using normalize and MinMaxScaler library of sklearn.preprocessing.

#Method 3: Robust Scaling

It’s scaling is similar to normalization but it instead uses the interquartile range, to make it robust to outliers. Because of interquartile property it does not take median into account and only focus on the part where the bulk data is.

Python implementation for robust-Scaler.

Step 6: Outliers removal

Outlier is one of the most faced issue which can easily give a misleading statistic results and as well knock down the model performance if not taken care. It is one of the essential part of data seen frequently by a data scientist and most of them aren’t sure how to deal with them.

Outlier

According to Wikipedia In statistics , an outlier is a data point that differs significantly from other observations.

Moore and McCabe says an outlier is an observation that lies outside the overall pattern of a distribution.

Another statistical definition is an outlier is an observation that lies an abnormal distance from other values in a random sample from a population.

An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

Types of outliers

Global outlier : A data can be considered anomalous with respect to the entire data, if its value is far outside the entire dataset . For example, Intrusion detection in computer networks.

2. Contextual outlier : When an individual data instance is anomalous in a specific context or condition (but not otherwise). e.g., time & location, temperature.

3. Collective outlier : When a collection of data points is anomalous with respect to the entire data set, it is termed as a collective outlier.

A human ECG report showing collective outlier

Outliers detector methods

Statistical methods

Gaussian distribution method

If we know that the distribution of data is Gaussian or Gaussian like in simple words is your data is normally distributed? if ans is yes then we can use the standard deviation of the data values to identify the outliers by applying simple property of normal distribution.

Gaussian distribution or normal distribution

For recap of normal distribution data coverage :

1 Standard Deviation(1 SD)from the Mean: 68%
2 Standard Deviations(2 SD) from the Mean: 95%
3 Standard Deviations(3 SD) from the Mean: 99.7%

Most of the sample values are covered in 3 SD, if value falls outside it can be considered as outlier which is an unlikely or rare event at approximately 1 of 400 samples. The 3 SD is a general rule it can be increased decreased based on type of problem worked, suppose if 99.9% data coverage is needed consider 4 SD instead. Let’s also see how we can code it

This is a one-dimensional data we saw our implementation, but what if data in is hyper-plane (n-dimensional data)? Well ans is simple we will follow same approach, imagine a 2-D data showing eclipse shape, consider a cut-off for boundary and any points lying outside the boundary will be considered an outlier.

Box-plot or Interquartile range(IQR) method

Not all data follow normal distribution, in that case we can use IQR method for the sample data. For theory refer my Descriptive statistic article. Let’s see a python implementation example as well.

2. Machine learning algorithm

DBSCAN

It is a density-based clustering approach ,very commonly used in unsupervised data, and not an outlier detection method per-se but due to algorithm approach used .It forms cluster based on radius and neighbor count and any point outside cluster is considered outlier. Core points -points that have a minimum of points in their surrounding- and points that are close enough to those core points together form a cluster.

The algorithm has two parameters (epsilon: length scale, and min_samples: the minimum number of samples required for a point to be a core point). Finding a good epsilon is critical.

DBSCAN algorithm pictorial representation

Coming up next final part-3

Handling the categorical data with 8 different encoding techniques like label, one-hot, target and many more.

Stay tuned for the final Part-3 . I hope you enjoyed this and any suggestions most welcome. Happy learning till then.

Data Cleansing and Preparation

Below are the steps involved to understand, clean and prepare your data for building your predictive model:

medium.com

Top 4 ways to encode categorical variables- Edvancer Eduventures

A real-world data set would have a mix of continuous and categorical variables. Many ML algorithms like tree-based…

www.edvancer.in

An Overview of Encoding Techniques

Explore and run machine learning code with Kaggle Notebooks | Using data from Categorical Feature Encoding Challenge

www.kaggle.com

Why, How and When to Scale your Features

Most of the times, your dataset will contain features highly varying in magnitudes, units and range. Learn how to…