Feature Engineering Cookbook for Machine Learning

Image for post
Image for post

When it comes to classic ML feature engineering is one if not the most important factors to improving your scores and speeding up your model without even bothering to tune or get fancy with your model.

There is not a lot of resources and books out there that cover feature engineering in depth, so I wanted to compile a list of code snippets covering most of the techniques I found online and used over time that were critical to most of the projects I worked on. These techniques mostly apply to decision tree and regression-type models (not deep learning).

My goal here was to give the coding examples and not cover the ins and outs on how each technique works and how they impact various models metrics, I am assuming you already heard of most of these techniques and that you will experiment to discover what works best for each project you’re working on.

If I missed anything, please mention it in the comments, and I will update the post with a mention.

I use the following datasets in this post:

I uploaded everything to this GitHub repo: https://github.com/michaelabehsera/feature-engineering-cookbook

Getting Started

1. Replacing NaN (or null) Values

Let’s take a look at how many null values there are in each column. It seems `Age` and `Cabin` are the most commonly-null features. We will focus on replacing null values for `Age`.

PassengerId      0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64

Mean/Median Imputation

Replacing with 0 or -1

Replacing with a Random Number. Random Sampling Imputation

We can replace age with random numbers between 1 and 100 as follows.

rand = random.randint(1, 101, size=num_null_rows)df.loc[null_rows, 'Age'] = randdf['Age'].plot.hist(title='Distribution of Age - replace null with random value')plt.show()
Image for post
Image for post

A smarter approach would be to replace Age with random samples from the non-null distribution of Age (as, with the previous approach, we would have generated just as many 99-year-olds as 25-year-olds).

Indicating Missingness

We could also use an additional 0/1 variable to indicate to our model when Age is missing.

Imputation of NA by Values at the End of the Distribution

Replacing With Values of Your Choosing Based on an Assumption.

Using Regression to Impute Attribute Missing Values

We will use these variables to predict missing values for Age (we could use others, but would need to convert from text).

Fit regression model to the non-null rows of data, predict the null rows.

2. Feature Scaling

Standard Scaler

x = StandardScaler().fit_transform(x)

MinMax Scaler

x = MinMaxScaler().fit_transform(x)

Robust Scaler

x = RobustScaler().fit_transform(x)

3. Engineering Outliers in Numerical Variables

Mean/median imputation or random sampling

If we have reasons to believe that the outliers are due to mechanical error or problems during measurement. This means, if the outliers are in nature similar to missing data, then any of the methods discussed for missing data can be applied to replace outliers. Because the number of outliers is in nature small (otherwise they would not be outliers), it is reasonable to use the mean/median imputation to replace them.

Identify Outliers with Quantiles

ulimit = q75 + 1.5*IQR
llimit = q25 - 1.5*IQR
print(ulimit, llimit, 'are the ulimit and llimit')
print('Imply Age outliers:')
df['Age'][np.bitwise_or(df['Age'] > ulimit, df['Age'] < llimit)]
Image for post
Image for post

Identify Outliers with Mean

Using the mean and standard deviation to detect outliers should only be done with data that is not very skewed. Age is somewhat skewed so this could be an issue.

#out: (73.248081099510756, -13.849845805393123)

Discretization

df = pd.read_csv('train.csv')not_null = ~df['Age'].isnull()# Get the bin edges using np.histogramnum_bins = 10
_, bin_edges = np.histogram(df['Age'][not_null], bins=num_bins)
# Optionally create labels# labels = ['Bin_{}'.format(i) for i in range(1, len(intervals))]
labels = [i for i in range(num_bins)]
# Create new feature with pd.cutdf['discrete_Age'] = pd.cut(df['Age'], bins=bin_edges, labels=labels, include_lowest=True)

Trimming

# Get the outlier values
index_of_high_age = df[df.Age > 70].index
# Drop them
df = df.drop(index_of_high_age, axis=0)

Winsorization (Top Coding Bottom Coding)

# Get the value of the 1st percentile (bottom 1%)
llimit = np.percentile(df.Age.values, 1)
# Create a copy of the age variable
df['Age_truncated'] = df.Age.copy()
# Replace all values above ulimit with value of ulimit
df.loc[df.Age > ulimit, 'Age_truncated'] = ulimit
# Replace all values below llimit with value of llimit
df.loc[df.Age < llimit, 'Age_truncated'] = llimit

Rank Transformation (When the Distances Don’t Matter so Much)

# This is like sorting the variable and then assigning an index starting from 1 to each valuerankdata(df['Age'], method='dense')

4. Engineering Labels, Categorical Variables

One-Hot-Encoding and Pandas Get Dummies

One-Hot-Encoding

#Convert transformed column from numpy to a DataFrame and merge new column with older one.onecol = pd.DataFrame(one)
results = pd.merge(onecol, df, left_index=True, right_index=True)

Get Dummies

pd.get_dummies(df['Sex']).head()

Dropping First

In models which use all features at once (most models apart from tree ensemble models, etc), it is wise to drop the first dummy variable, as it can be derived from the others (e.g. here we know if someone is Female based on whether they are Male, so we can drop Female).

Mean Encoding

We calculate the mean of the target variable for each class of the variable we wish to encode and replace that variable by these means.

means_pclass = df[['Survived']].groupby(df['Pclass']).apply(np.mean)
means_pclass.columns = ['Mean Encoding']
means_pclass
# Merge the encoding into our dataframe (by matching Pclass to the index of our prob. ratio dataframe)df = pd.merge(df, means_pclass, left_on=df.Pclass, right_index=True)

Probability Ratio Encoding

# Calculate the probability ratio encoding
prob_ratios_pclass = df[['Survived']].groupby(df['Pclass']).apply(lambda x: probability_ratio(x))
prob_ratios_pclass.columns = ['Prob Ratio Encoding']
prob_ratios_pclass
# Merge the encoding into our dataframe (by matching Pclass to the index of our prob. ratio dataframe)
df = pd.merge(df, prob_ratios_pclass, left_on=df.Pclass, right_index=True)
df.head()

Weight of Evidence Encoding

# Calculate the probability ratio encodingwoe_pclass = df[['Survived']].groupby(df['Pclass']).apply(lambda x: weight_of_evidence(x))
woe_pclass.columns = ['WOE Encoding']
woe_pclass
df = pd.merge(df, woe_pclass, left_on=df.Pclass, right_index=True)

Label Encoding

Cat.codes

df['Sex'] = df['Sex'].astype('category')# Cat.codes only works if the dtype is 'category'df['Sex'].cat.codes.head()

Factorize

Also achieves a similar end to `cat.codes`, but gives us the labels of each category.

Binary Encoding

#Can also use np.where to specify which values should be 1 or 0, as follows

5. Engineering Dates

Creating Columns Based on Hour/Min…

time = pd.to_datetime(df['time'])

Pandas come with packed with DateTime properties which you could check out here: https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties. You can even get a column with microseconds. Here are some of the ones I use most.

  • Month
  • Min
  • Seconds
  • Quarter
  • Semester
  • Day (number)
  • Day of the week
  • Hr

Creating an isweekend Column

6. Engineering Mixed Variables

We’ve seen that mixed variables are those which values contain both numbers and labels.

How can we engineer this type of variable to use it in machine learning?

What we need to do in these cases is extract the categorical part in one variable and the numerical part in a different variable. Therefore, we obtain 2 variables from the original one.

There is not much to cover here besides giving one fake example.

lst_strings = []
lst_int = []
for i in data:
if i == 'Apple':
lst_strings.append(i)
elif i == 'Banana':
lst_strings.append(i)
if i == '2':
lst_int.append(int(i))
elif i == '6':
lst_int.append(int(i))

7. Engineering Rare Labels in Categorical Variables

Let's say there is a small % of values within a large group of categories in a feature, you can grab them and call them “other” in order to reduce the values in the dataset and potentially reduce overfitting.

These observations can be re-categorized by:

  • Replacing the rare label by most frequent label
  • Grouping the observations that show rare labels into a unique category (with a new label like ‘Rare’, or ‘Other’)

Replacing the Rare Label by Most Frequent Label

Grouping the Observations that Show Rare Labels Into a Unique Category (With a New Label like ‘Rare’, or ‘Other’)

8. Transforming Variables

Particularly for linear models, it can be useful to transform input features (or the target) prior to fitting the model, such that their distribution appears normal, or approximately normal (i.e. symmetric and bell-shaped).

Gaussian Transformation

df['Age Log'] = df['Age'].apply(np.log)df['Age Log'].plot.hist()
Image for post
Image for post

Reciprocal Transformation

df['Age Reciprocal'].plot.hist()
Image for post
Image for post

Square Root Transformation

df['Age Sqrt'].plot.hist()
Image for post
Image for post

Exponential Transformation

df['Age Exp'].plot.hist()
Image for post
Image for post

Boxcox Transformation

df[‘Age BoxCox’] = boxcox(df[‘Age’])[0]
df[‘Age BoxCox’].plot.hist()
Image for post
Image for post

9. Interaction Features

Perhaps, for example, older passengers who also paid a higher fare had a *particularly* high chance of not surviving on the Titanic. In such a case we would call that an *interaction* effect between `Age` and `Fare`. To help our model take account of this interaction effect, we can add a new variable `Age * Fare`.

df['Age_x_Fare'] = df['Age'] * df['Fare']

This is of course just one example of an interaction feature, this is a powerful technique and could use some creativity on your part based on the dataset you’re dealing with.

Written by

Design / Machine Learning. Previously @ Viz.ai, Toptal and MakeSchool. michaelabehsera.com

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store