Standardize or Normalize? — Examples in Python
A common misconception is between what it is — and when to —
standardize data versus
Let’s start with
Here your data
Z is rescaled such that any specific
z will now be
0 ≤ z ≤ 1, and is done through this formula:
Let’s to do this with python on a dataset you can quickly access.
from sklearn import preprocessing
import numpy as np
# Get dataset
df = pd.read_csv("https://storage.googleapis.com/mledudatasets/california_housing_train.csv", sep=",")
# Normalize total_bedrooms column
x_array = np.array(df['total_bedrooms'])
normalized_X = preprocessing.normalize([x_array])
Why would we normalize in the first place?
1. Normalization makes training less sensitive to the scale of features, so we can better solve for coefficients.
Consider the dataset above of housing prices in California, which have features such as the
number of bedrooms and the
median household income. Each have different units and scales, so consider these feature attributes when going through this article.
Let’s start by looking at both features without normalization.
We can see that there are some odd behaviours with both features (how can we have the number of
bedrooms be over 1000?) as well as massive outliers and binning issues. We also have a clustering of
income at $500,000 so the dataset probably puts anyone over that bracket into that bin. It’s going to be hard to equate both these features as they are right now.
Let see what
All the values are all now between 0 and 1, and the outliers are gone, but still remain visible within our normalized data. However, our features are now more consistent with each other, which will allow us to evaluate the output of our future models better.
2. The use of a normalization method will improve analysis from multiple models.
Additionally, if we were to use any algorithms on this data set before we normalized it would be hard (potentially not possible) to converge the vectors because of the scaling issues. Normalization makes the data better conditioned for convergence.
3. Normalizing will ensure that a convergence problem does not have a massive variance, making optimization feasible.
But wait … sometimes you might not want to
normalize your data.
The data provided is proportional, so
normalizing might not provide correct estimators. Or, the scale between your data features does matters so you want to keep in your dataset. You need to think about your data, and understand if the transformations you’re applying are in line with the outcomes you’re searching for.
Keep in mind, there is some debate stating it is better to have the input values centred around 0 — standardization — rather than between 0 and 1. So doing your research is important as well, so you understand what type of data is needed by your model.
So now that we know how and why to normalize, let’s move on to standardization.
Here your data
Z is rescaled such that
μ = 0 and
𝛔 = 1, and is done through this formula:
Why would we do this?
1. Compare features that have different units or scales.
Consider our data above with
income, both have different scales and units. We can start to compare these features and use them in our models once we have
Later, when you’re running models (logistic regression, SVMs, perceptrons, neural networks etc.) the estimated weights will update similarly rather than at different rates during the build process. This will give you more accurate results when the data has been first
Let see it in python:
from sklearn import preprocessing
# Get column names first
names = df.columns
# Create the Scaler object
scaler = preprocessing.StandardScaler()
# Fit your data on the scaler object
scaled_df = scaler.fit_transform(df)
scaled_df = pd.DataFrame(scaled_df, columns=names)
Looks like we’ve adjusted for all the outlier values in
income, and we have a much more normal distribution for each feature. It’s not perfect, but the data is in much better shape than it was when we ran our
normalization. It seems because of the large difference in scales and units,
standardizing is a better transformation for this data set.
2. Standardizing tends to make the training process well behaved because the numerical condition of the optimization problems is improved.
Consider if you’re doing PCA, the output can only be interpreted correctly when the features have first been centred around their means. Again, understanding what you want to achieve and the model you’ll be using, are necessary conditions to understanding different transformations decisions.
However, if you do standardize your data be warned you might be discarding some information. If that information is not needed, the process can be helpful else it will impede your results.
Bonus Thing: Binning
Let’s take a look at one more thing before we leave …
latitude feature in our dataset, which has a geo point of the area in question. How would we think about
normalizing this? We could do either, but there is a third alternative:
We’re going to made new columns for each
latitude range, and encode each value in our dataset with a
0 or 1 to see if it is within that
# Create range for your new columns
lat_range = zip(xrange(32, 44), xrange(33, 45))
new_df = pd.DataFrame()
# Iterate and create new columns, with the 0 and 1 encoding
for r in lat_range
new_df["latitude_%d_to_%d" % r] = df["latitude"].apply(
lambda l: 1.0 if l >= r and l < r else 0.0)
Now that we can
binned values, we have a binary value for each
latitude in California. With this additional approach, you have another way to clean your data and get it ready for modelling.
As always, I hoped this cleared up a few things and provided some concrete examples for you to work on.
At work we were discussing this as my boss has never heard of normalization. In Linear Algebra, Normalization seems to…stats.stackexchange.com