Tan Moy
Tan Moy
Sep 8, 2018 · 2 min read

Ok, I see.
The point of normalization is to scale the features so that all the features have some common nominal range. If one of the features has a broader range than the others it will dominate and the theta values will be skewed towards that.
Let’s say you divide the x2 values by 4 or 4.5, this will just decrease the range of x2. What happens to the other columns? Will you divide them by 4 or 4.5 too? If not, what will you divide them with so that all columns have common range? What if you have a large number of features? Will you try to figure out the common range by trial and error?
Remember that all the features will be taken into account while calculating the theta value, so you cannot just divide one column by 4 and another by 2. That will destroy the data.
To deal with this we have multiple ways, one is what we used here: Standardization which makes the values of each feature in the data have zero-mean(i.e centered on zero) and have unit-variance. It changes all columns in the whole data set to be in a particular range. And it gives a particular theta value. In fact, all applicable feature scaling methods in the Wikipedia article I linked will give similar values.
That’s because we didn’t change the data(which dividing only one column by a constant will do) we just adjusted all the data in our sample such that they have common range. If you can find a number by hand that does that, well, you will get similar theta values.

    Tan Moy

    Written by

    Tan Moy

    WebDev | Data Scientist | Writer | EE