Techniques to Transform Data Distribution!!!

Abhigyan
Analytics Vidhya
Published in
5 min readSep 20, 2020

In Machine Learning most of the algorithms work on the assumption of the normal distribution of the data.
However not all machine learning algorithms make such assumptions to know beforehand the type of data distribution it will work on but learns it directly from the data used for training.

Content:

  1. What is the need for and Importance of Gaussian Distribution?
    → What is Gaussian Distribution?
    → Need for Normal Distribution?
    → Importance of Normality in Machine Learning!
  2. Need for Data Transformation!!
  3. Importance of Data Distribution Transformation.
  4. Different methods to Transform the Distribution.
    → The ladder of powers.
    → Box-Cox Transformation Method.
    → Yeo-Johnson Transformation Method.

Let’s have a look at the importance of Normality and ways to transform the distribution of our data.

What is the need for and Importance of Gaussian Distribution?

What is Gaussian Distribution?

  • Gaussian distribution aka the “Normal Distribution”, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean.
    → 68.2% of all values are present between the Mean ± 1 Standard Deviation.
    → 95.5% of all values are present between the Mean ± 2 Standard Deviation.
    → 99.7% of all values are present between the Mean ± 3 Standard Deviation
  • A Gaussian(Normal) distribution is a proper term for a probability bell curve.
  • In a Gaussian(Normal) distribution, the mean is zero and the standard deviation is 1. It has zero skew and kurtosis of 3.
  • Gaussian(Normal) distributions are symmetrical, but not all symmetrical distributions are normal.

Need for Gaussian(Normal) Distribution?

  • The normal distribution model is based on the Central Limit Theorem. when independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a bell curve), even if the original variables themselves are not normally distributed.
  • The nature of the distribution often leads to extremely efficient computation.
  • Because it makes the math simpler.
    The probability density function for the normal distribution is an exponential of a quadratic. Taking the logarithm (as you often do, because you want to maximize the log-likelihood) gives you a quadratic. Differentiating this (to find the maximum) gives you a set of linear equations, which are easy to solve analytically.

Importance of Normality in Machine Learning!

  1. Gaussian distribution is found everywhere because a dataset with finite variance turns into Gaussian as long as the dataset with independent feature-probabilities is allowed to grow in size.
  2. Datasets with Gaussian distributions make applicable to a variety of methods that fall under parametric statistics.
    The methods such as propagation of uncertainty and least squares parameter fitting that make a data-scientist life easy apply only to datasets with normal or normal-like distributions.
  3. Since Gaussian(Normal) Distribution is easy to explain, the intuition behind the conclusion and summary of the test can be easily conveyed to people with little statistical knowledge.
  4. The entire distribution is described by two numbers, the mean and variance.
  5. Unlike much other distribution that changes their nature on transformation, a Gaussian tends to remain a Gaussian.
    * The product of two Gaussian is a Gaussian
    * The Sum of two independent Gaussian random variables is a Gaussian
    * Convolution of Gaussian with another Gaussian is a Gaussian
    * Fourier transform of Gaussian is a Gaussian

Need for Data Transformation!!

  • To more closely approximate a theoretical distribution that has nice statistical properties
  • To spread data out more evenly -to make data distributions more symmetrical
  • To make relationships between variables more linear
  • To make data have more constant variance (homoscedastic)

Importance of Data Distribution Transformation

  • Machine learning algorithms provide better results(predictions) when numerical data present are Normally distributed.
  • To make the cost function minimize better the error of the predictions.

Different methods to Transform the Distribution

→ The ladder of powers

Data transformations are commonly power transformations, x’=xθ (where x’ is the transformed x).

  • If the data are right-skewed (clustered at lower values) move down the ladder of powers (that is, try square root, cube root, logarithmic, etc. transformations).
  • If the data are left-skewed (clustered at higher values) move up the ladder of powers (cube, square, etc).

Box-Cox Transformation Method

  • The Box-Cox method is a data transform method that can perform a range of power transforms, including the log and the square root. The method is named for George Box and David Cox.
  • It can be configured to evaluate a suite of transforms automatically and select the best fit.
  • The resulting data sample may be more linear and will better represent the underlying non-power distribution, including Gaussian.

In SciPy:

The boxcox() SciPy function implements the Box-Cox method. It takes an argument, called lambda, that controls the type of transform to perform.

  • lambda = -1. is a reciprocal transform.
  • lambda = -0.5 is a reciprocal square root transform.
  • lambda = 0.0 is a log transform.
  • lambda = 0.5 is a square root transform.
  • lambda = 1.0 is no transform.
from scipy.stats import boxcoxy,fitted_lambda= boxcox(y,lmbda=None)

In the sklearn:

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='box-cox')
data = pt.fit_transform(data)

A limitation of the Box-Cox transform is that it assumes that all values in the data sample are positive.

Yeo-Johnson Transformation Method

  • Unlike the Box-Cox transform, it does not require the values for each input variable to be strictly positive.
  • It supports zero values and negative values. This means we can apply it to our dataset without scaling it first.

In SciPy:

from scipy.stats import yeojohnsony,fitted_lambda = yeojohnson(y,lmbda=None)

In Sklearn:

We can apply the transform by defining a PowerTransform object and setting the “method” argument to “yeo-johnson

from sklearn.preprocessing import PowerTransformer
pt = PowerTransformer(method='yeo-johnson')
data = pt.fit_transform(data)

Like my article? Do give me a clap and share it, as that will boost my confidence. Also, I post new articles every Sunday so stay connected for future articles of the basics of data science and machine learning series.

Also, do connect with me on LinkedIn.

Photo by Markus Spiske on Unsplash

--

--