Feature Extraction and Standardization

Scale your data for models that NEED IT!

Gaurika Tyagi
2 min readJul 20, 2020

Categorical Data

We can not use textual data as in for machine learning. We will have o convert it to a numerical representation. This can be done in 2 ways:

  1. One hot encoding: Assume a column of “yes”/ “no”. This can be changed to 0 and 1. A column of 4 different values can be encoded as 00, 01, 10, 11 and so on
  2. Getting Dummies: for every value in a categorical feature, we create a new column and give it a value of 0 or 1. For example, Gender: “male”, “female” “other” can be changed to Gender_male (with values 0/1) and so on. Here, we can create just 2 columns for “male” and “female” and implicitly let the model learn that other = male=0 and female=0. But, always be careful of what you are trying to achieve.

Know your Data!

All preprocessing is done only on Train data and then it is propagated to the test data.

Image by Author: Change categorical data to numeric representation

If you noticed, I dropped some dummy columns which I know I would not get much information out of and it is okay for the models to implicitly understand these values.

Numerical Data- Standardization

I did not apply log normalization even though we care about the relative changes in the columns because my data has 0s. The domain for log is strictly greater than 0. That’s a vertical asymptote heading down the y-axis. As x approaches 0, y approaches negative infinity. In other words, 0 is excluded from the domain.

But, I can add 1 to all my data and then take a log transform. This also maintains my spread. Cool right!

from sklearn.preprocessing import PowerTransformer
standardizer = PowerTransformer()
from sklearn.preprocessing import StandardScaler
normalizer = StandardScaler()
standardizer.fit(x_train.select_dtypes(include=['int', 'float'])+1)
transformed_interim = standardizer.transform(x_train.select_dtypes(include=['int','float']))
# transformed = pd.DataFrame(transformed_interim,
# columns = x_train.select_dtypes(include=['int','float']).columns)
normalizer.fit(transformed_interim)
transformed = pd.DataFrame( normalizer.transform(transformed_interim),
columns = x_train.select_dtypes(include=['int','float']).columns)
# Plot the data before and after the transformation
fig = plt.figure(figsize = (18.5, 7.5))
ax = fig.gca()
x_train.select_dtypes(include=['int','float']).hist(ax=ax)
plt.show()
Image by Author: Original data Spread
fig = plt.figure(figsize = (18.5, 7.5))
ax = fig.gca()
transformed.hist(ax=ax)
plt.show()
Image by Author: Normalized Data Spread

Did you see how the spread of the data changed! Now, you can use this dataset for machine learning. All the best!

--

--

Gaurika Tyagi

Data Scientist by Profession. Data geek by choice. Always learning. Deep Learning, Quantitative Machine Learning, NLP