Min Max Scaler

Ranjit maity
5 min readJul 24, 2021

Introduction →

Many machine learning algorithms perform better when numerical input variables are scaled to a standard range.

This includes algorithms that use a weighted sum of the input, like linear regression and algorithms that use distance measures, like k-nearest neighbours.

The two most popular techniques for scaling numerical data prior to modeling are normalization and standardization. Normalization scales each input variable separately to the range 0–1, which is the range for floating-point values where we have the most precision. Normalization nothing But MIN-MAX Scaler

In this tutorial, you will discover how to use scaler transforms to normalize numerical input variables for classification and regression.

In this tutorial, you will discover how to use scaler transforms to standardize and normalize numerical input variables for classification and regression.

After completing this tutorial, you will know:

  • Data scaling is a recommended pre-processing step when working with many machine learning algorithms.
  • Data scaling can be achieved by normalizing or standardizing real-valued input and output variables.
  • How to apply normalization to improve the performance of predictive modelling algorithms.

The Scale of Your Data Matters

Machine learning models learn a mapping from input variables to an output variable.

As such, the scale and distribution of the data drawn from the domain may be different for each variable.

Input variables may have different units (e.g. feet, kilometres, and hours) that, in turn, may mean the variables have different scales.

Differences in the scales across input variables may increase the difficulty of the problem being modelled. An example of this is that large input values (e.g. a spread of hundreds or thousands of units) can result in a model that learns large weight values. A model with large weight values is often unstable, meaning that it may suffer from poor performance during learning and sensitivity to input values resulting in higher generalization error.

Numerical Data Scaling Methods

Both normalization and standardization can be achieved using the scikit-learn library.

Let’s take a closer look.

Data Normalization

Normalization is a rescaling of the data from the original range so that all values are within the new range of 0 and 1.

Normalization requires that you know or are able to accurately estimate the minimum and maximum observable values. You may be able to estimate these values from your available data.

A value is normalized as follows:

  • y = (x — min) / (max — min)

Here y→output data after scale

X→ input Data

min→minimum value of a column

max→maximum value of a column

the minimum and maximum values pertain to the value x being normalized.

For example, for a dataset, we could guesstimate the min and max observable values as 30 and -10. We can then normalize any value, like 18.8, as follows:

  • y = (x — min) / (max-min)
  • y = (18.8 — (-10)) / (30 — (-10))
  • y = 28.8 / 40
  • y = 0.72

You can see that if an x value is provided that is outside the bounds of the minimum and maximum values, the resulting value will not be in the range of 0 and 1. You could check for these observations prior to making predictions and either remove them from the dataset or limit them to the pre-defined maximum or minimum values.

You can normalize your dataset using the scikit-learn object MinMaxScaler.

Good practice usage with the MinMaxScaler and other scaling techniques is as follows:

  • Fit the scaler using available training data. For normalization, this means the training data will be used to estimate the minimum and maximum observable values. This is done by calling the fit() function.
  • Apply the scale to training data. This means you can use the normalized data to train your model. This is done by calling the transform() function.
  • Apply the scale to data going forward. This means you can prepare new data in the future on which you want to make predictions.

The default scale for the MinMaxScaler is to rescale variables into the range [0,1], although a preferred scale can be specified via the “feature_range” argument and specify a tuple, including the min and the max for all variables.

We can demonstrate the usage of this class by converting two variables to a range 0-to-1, the default range for normalization. The first variable has values between about 4 and 100, the second has values between about 0.1 and 0.001.

The complete example is listed below(python3).

# example of a normalization
from numpy import asarray
from sklearn.preprocessing import MinMaxScaler
# define data
data = asarray([[100, 0.001],
[8, 0.05],
[50, 0.005],
[88, 0.07],
[4, 0.1]])
print(data)
# define min max scaler
scaler = MinMaxScaler()
# transform data
scaled = scaler.fit_transform(data)
print(scaled)

Running the example first reports the raw dataset, showing 2 columns with 4 rows. The values are in scientific notation which can be hard to read if you’re not used to it.

Next, the scaler is defined, fit on the whole dataset and then used to create a transformed version of the dataset with each column normalized independently. We can see that the largest raw value for each column now has the value 1.0 and the smallest value for each column now has the value 0.0.

output ↓

[[1.0e+02 1.0e-03]
[8.0e+00 5.0e-02]
[5.0e+01 5.0e-03]
[8.8e+01 7.0e-02]
[4.0e+00 1.0e-01]]
[[1. 0. ]
[0.04166667 0.49494949]
[0.47916667 0.04040404]
[0.875 0.6969697 ]
[0. 1. ]]

You can try the same with manually by your own with following way (python 3)….

import pandas as pd
def normalize(df):
for i in df.columns:
clm=df[i]
print(i)
min_ = np.min(clm)
max_ = np.max(clm)
k=df[i].values
new=[]
for v in k:
vk=((v — min_) /(max_ — min_))
new.append(vk)
df[i]=new
arrey=df.to_numpy()

return arrey
normalize(df2)

output →
a b
0 1.000000 1.000000
1 0.000000 0.102564
2 0.080357 0.000000

Common Question ……

Q.When I should Normalize or standaradze?

ans→Whether input variables require scaling depends on the specifics of your problem and of each variable.

You may have a sequence of quantities as inputs, such as prices or temperatures.

If the distribution of the quantity is normal, then it should be standardized, otherwise, the data should be normalized. This applies if the range of quantity values is large (10s, 100s, etc.) or small (0.01, 0.0001).

If the quantity values are small (near 0–1) and the distribution is limited (e.g. standard deviation near 1), then perhaps you can get away with no scaling of the data. Predictive modelling problems can be complex, and it may not be clear how to best scale input data.

If in doubt, normalize the input sequence. If you have the resources, explore modelling with the raw data, standardized data, and normalized data and see if there is a beneficial difference in the performance of the resulting model.

--

--

Ranjit maity

CERTIFIED DATASCINTIST ,BY PROFESSION DATA SCINTIST,PART TIME → DATA ENGINEER||ETL||BIGDATA||AI