Normalize before Training

Sunner Li
Sunner Li
Jul 30, 2017 · 2 min read

In the previous article, I normalize the training data manually. However, after I check some kernels on Kaggle, some interesting method can be learned. This article records the methods.

Now we want to predict if the country has nuclear weapon. The above table is the data CSV file whose name is nuclear.csv. Each row represents one country. The first attribute is the number of soldiers in the particular country, and the second one is the total number of population. The last attribute represent if the country has nuclear weapon. 1 represents it has.

The first function that I think is very useful is train_test_split. This function can split the whole x and y into training data and testing data. The shape of x is [row_num, feature_num] while the shape of y is [row_num].

The second idea that is worth to learn is MinMaxScaler. This class can scale the column into the range between 0~1. The advantage of MinMaxScaler is that it can make the variance be more obvious.

The left side is a simple example. The different of 0.5 is relatively smaller than the mean (seven hundred thousand). However, the different will be obvious after MinMaxScaler works.

The MinMaxScaler is a common class that I can see on the Kaggle kernel. On the other hand, StandardScaler is another class that can help us arrange the data. it can makes the mean of each column become 0, and the variance of each column become 1.

However, both scalier have their own disadvantage. MinMaxScaler will control the value to be in the range of 0~1. If the testing data is out of range, it might get the incorrect predicted result. StandardScaler just normalize the column value. On the contrary, some result will be negative after the work. Some architecture (ex. ReLU) might be sensitive toward this property.

The last part is train and test the model! I use random forest to make the prediction. Even though the result is not good enough.

For the conclusion, we can use train_test_split to get the corresponding sets of data. MinMaxScaler and StandardScaler can help us normalize the feature, and use these data to do the further work!

Sunner Li

Written by

Sunner Li

我不是二次宅,但我因夢想而熱血! 雖然我不是天才,但我將繼續用我的毅力、熱血和能力,解決一些問題,幫助周遭朋友,甚至改變世界的一點點!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade