Feature Engineering Method with code Examples

Haitian Wei
Analytics Vidhya
Published in
4 min readOct 31, 2019
source

Introduction

Feature engineering is a very important aspects of machine learning. Feature engineering is a solid way to get the most from data. It’s usually more effective than finding the best model and hyper parameters.

I recently found a very good tutorial on feature engineer provided by kaggle. So this article is pretty much the thing I learned from this tutorial and some other sources which I listed in the reference section.

Categorical Features

Encode Categorical Features

To change categorical feature to numerical values, we can do either factorize or one hot encoding.

  • Factorize

Factorize means replace unique values with numbers. We can use either Pandas factorize or scikit-learn LabelEncoder to achieve the same result.

Or in the case of scikit-learn LabelEncoder

from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
# Apply the label encoder to each column
df[categorical_col] = df[categorical_col].apply(encoder.fit_transform)
  • One Hot Encoding

In the case of one hot encoding, features are expand to the number of unique values.

Interactions

One simple method to generate new categorical features is too ‘add’ two categorical features.

Count Encoding of Categorical Features

Count encoding replaces each categorical value with the number of times it appears in the dataset.

We can simply use group-transform-count method or the categorical-encodings package to get this encoding.

import category_encoders as ce
cat_features = ['category', 'currency', 'country']
count_enc = ce.CountEncoder()
count_encoded = count_enc.fit_transform(ks[cat_features])

data = baseline_data.join(count_encoded.add_suffix("_count"))

# Training a model on the baseline data
train, valid, test = get_data_splits(data)
bst = train_model(train, valid)

or simply like this:

Target Encoding

Target encoding replaces a categorical value with the average value of the target for that value of the feature.

However with this method we need to be very careful with ‘target leakage’ problem. We should learn the target encodings from the training dataset only and apply it to the other datasets.

import category_encoders as ce
cat_features = ['category', 'currency', 'country']

# Create the encoder itself
target_enc = ce.TargetEncoder(cols=cat_features)

train, valid, _ = get_data_splits(data)

# Fit the encoder using the categorical features and target
target_enc.fit(train[cat_features], train['outcome'])

# Transform the features, rename the columns with _target suffix, and join to dataframe
train = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

train.head()
bst = train_model(train, valid)

or we can use beta target encoding to further reduce leakage.

CatBoost Encoding

According to this notebook :

catboost is similar to target encoding in that it’s based on the target probablity for a given value. However with CatBoost, for each row, the target probability is calculated only from the rows before it.

And we can learn it from the official document here. However I haven’t figure out the example in the document. So I will leave it for now. If you understand it, please leave a note.

There are lots of other method provided in categorical-encoding which I haven’t gone through yet:

Group Features of Categorical Features

Another commonly used method is to group data by categorical features and calculate aggregate values of other features. Transform method can be really helpful in this case.

Continuous Features

Transform

One very common technique regarding continuous features is to transform it with some functions. Since some models work better when the features are normally distributed, so it might help to transform the goal values.

Common choices for this are the square root and natural logarithm. These transformations can also help constrain outliers. However this will not affect tree models.

Rolling

For values with time series index, it’s a good idea to try some rolling features. Like rolling mean, count and so on.

Finally, Feature Selection

After we have created tons of new features, it’s crucial to select those that work best to increase calculation efficiency and reduce overfitting problem. I have written two articles about feature selection:

Next Level Feature Selection Method with Code Examples

Feature Selection Methods with Code Examples

Hopefully we will get a good model ^-^

References

--

--