Feature Engineering in Machine Learning (Part 2)

Feature Scaling or Normalization

Sogo Ogundowole
Hacktive Devs
3 min readMar 22, 2019

--

Intro

In part one of this series, we looked at Handling Numeric Data with Binning. Kindly check it out if you have not read it yet. Feature engineering as introduced in the first part is the process that involves tweaking the data to suit what we plan to achieve.

In this article, we will be looking at another feature engineering process called Feature Scaling or Normalization.

Feature Scaling

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data preprocessing step. It involves changing the scale of features in a dataset.

The reason for this is to adjust data that has different scales so as to avoid biases from (big) outliers.

An outlier is a value that is very different from the other data in your data set. This can skew your results. As you can see, having outliers often has a significant effect on your mean and standard deviation.

In the list below are several types of common scaling operations, each result in a different distribution of feature value:

  • Min-max Scaling
  • Standardization (Variance Scaling)
  • L² Normalization

Min-max Scaling

The formula for min-max scaling is:

= (X — min(X))/(max(X) — min(X))

Scaling occurs for one feature at a time. Let x be an individual feature value (i.e. a value of the feature in some data point), and min(x) and max(x) respectively the minimum and maximum of this feature over the entire dataset and X̄ is the scaled value. Min-max scaling squeezes (or stretches) all feature values to be within the range of [0, 1].

Source: Mastering Feature Engineering

Standardization (Variance Scaling)

Feature standardization is defined as:

X̄ =( X — mean(X))/var(X)

It subtracts the mean of the feature (overall data points) and divides by the variance, hence it can also be called “variance scaling.” The resulting scaled feature has a mean of 0 and a variance of 1. If the original feature has a Gaussian distribution, then the scaled feature is a standard Gaussian.

Gaussian distribution (also known as normal distribution) is a bell-shaped curve, and it is assumed that during any measurement, values will follow a normal distribution with an equal number of measurements above and below the mean value

Source: Mastering Feature Engineering

LNormalization

This technique normalizes by dividing the original feature value by what’s known as the L₂ norm, also known as the Euclidean norm. The formula for this method is:

X̄ = (X)/ (∥ X∥₂)

The L² norm measures the length of the vector in coordinate space and these can be derived from this equation:

∥ X∥₂ = (X₁ ²+ X₂ ² + … + Xm²)

The square root of the total sum of the square of values of the features across data points.

Source: Mastering Feature Engineering

Conclusion

No matter the scaling method, feature scaling always has a constant divisor (also known as the normalization constant). Therefore, the shape of the single-feature distribution does not change. Where the range of the values of features is very wide, scaling helps to make our features fit for good modeling.

Thanks for reading! 😊

--

--