Data science : Scaling of Data in python.

Data play a major role in data analytics and data science . It is definitely the basis of all the process in these eco space . This blog is going to talk about feature scaling . what is it ? , why do we need it ? and how to we use it. I’ll be using python to show how it is used. To better understand this concept i would recommend you take look at my previous blog on Linear regression as i will be using this concept as a part of my explanation . So lets get started .
What is Feature scaling ?
Features are basically your column names and the respective data in that column will be of similar feature , this is in your everyday conventional datasets , most of them are usually in different quantitative measurements and in different magnitudes .Example the column with the name height will have data in cm (centimetre ) and column with weight will have data in Kg(kilogram).
Scaling data is the process of increasing or decreasing the magnitude according to a fixed ratio , in simpler words you change the size but not the shape of the data .
Why do we need to use feature scaling ?
It is not mandatory to use feature scaling but it definitely is a good practice . It helps handling disparities in units . During long processes it definitely helps reduce computational expenses .
In the machine learning eco space , it helps improve the performance of the model and reducing the values/models from varying widely .
How do we perform feature scaling ?
There different types of feature scaling which are :
- Centring
- Standardization
- Normalization
lets work on them indiviaually .
Centering :
The primary benefit of centering your predictor data in linear modeling is so the intercept represents the estimate of the target when all predictors are at their mean value.
This basically means that when x=0 , the predictor value will be equal to the intercept and that can sometime not used for the best interpretation . The will get clearer from the below example in python .
Example:
We are going to use the baseball data set , where the y value will be the height and x value will be the height ,now lets try fitting a lear model in this data plots .




Lets try interpreting this , the intercept is -154 does that mean that when a baseball player has height in 0 inches his weight will be -154. That doesn’t sound right does it? To eliminate this misinterpretation we use centring.
Now lets see how that works , below id the equation which describes : Xc = the each individual value of x minus the mean of all the x values.

Lets apply that in over previous example see if that helps .


Lets try interpreting this , the intercept is 201 , this is the estimated weight of a baseball player of Average height = 73.69 inches.So this feel like a better interpretation than the previous data.
Note : In ordinary linear regression centering and scaling your variables does not impact the amount of variance you can account for. This is because we are only moving and and adjusting the magnitude of the distribution: the shape of the distribution does not change.
Standardization:
The most common method of scaling is standardization, in this method we center the data, then we divide by the standard devation to enforce that the standard deviation of the variable is one:

Benifits , some concept will be discussed later :
- Intercepts are interpreted as the estimate when all predictors are at their mean value.
- Coefficients are in units of standard deviations of the original predictors. This allows for direct comparison of the magnitude of impact between different predictors.
- Optimization methods (minimizing loss functions) are faster and more stable.
- It is required for regularization penalties where the magnitude of coefficients for different predictors must have the same meaning.
- In K-Nearest Neighbors methods it is necessary if you want features to contribute equally since these models use the distance between observations calculated from the features.
- K-means clustering is affected by the scale of the data and standardizing the features will prevent variables from dominating simply based on their scale.
- In logistic regression, neural networks, and support vector machines unscaled data can result in a disproportionate effect of some data points over others.
Most of these benefits will be used in more advanced machine learning models.
Now lets try this in python:
using StandardScaler from sklearn :


Lets interpret this :
- For 1 standard deviation increase in height estimates -09.5 years age
- For 1 standard deviation increase in weight estimate 1.18 years age
Makes it easier to compare different quantity measurements with each other .
Normalization:
Normalization most often refers to the process of “normalizing” a variable to be between 0 and 1. Think of this as squishing the variable to be constrained to a specific range. This is also called as min-max scaling

Benefits:
Typically standardization is preferred to min-max normalization. However, there are some applications where min-max scaling would be preferable:
- Neural networks often require their inputs to be bounded between 0 and 1.
- In images, for example, where pixels can only take on a specific range of RGB values, data may have to be normalized.

Concluding:
So these were some of the scaling methods used in the data eco space or in data science , in my next blog I will be getting into a little bit of test-split and classification in machine learning .

