Getting Started with Feature Engineering for Supervised Learning

Published in

84.51°

8 min readFeb 14, 2020

By Phil Azar, 84.51° Data Scientist

“Feature Engineering for Supervised Learning” sounds like a mouthful, but if we break down the title into its parts we can move past the buzzwords into proven tactics that improve machine learning models and lead to more accurate business decisions.

Supervised learning is a branch of machine learning that seeks to predict a specific target based on the relationship between the target variable and features. These features can be raw or derived from statistical techniques and subject matter expertise, and your model will use them to predict the target variable.

Take an example problem: predicting the sales of Kroger stores. In this example, your target variable is a store’s sales and your features will include anything that affects sales. Features might include a store’s square footage, nearby competitors, and sales from the previous year. Feature engineering is an additional step beyond thinking up the features, and toward preparing our feature set appropriately to optimize our model’s learning. As tree-based and deep learning methods become more accessible, it is important our training data is engineered to maximize the strengths of these models.

Below I will discuss some starting points for feature engineering and how each method can be best implemented for different types of learners.

From Categories to Numbers

Machine learning models see the world in terms of numbers, but often our data is categorical. This requires us to transform categorical data into numbers that our model can understand.

Some simple strategies to turn categories into usable numbers include one-hot encoding, ordinal encoding, frequency encoding, and mean encoding.

One-hot Encoding and Dummy Encoding

One-hot encoding creates K new features for K number of categories, each encoded as 1 if the observation belongs to category k, 0 if not. A close relative of one-hot encoding is dummy encoding, in which one of the resulting columns is dropped, leaving K-1 new features. No information is lost in dummy encoding because the presence of that category can be inferred, as all other created columns will be zero in records of that category. Dummy encoding is preferred in models where perfect multicollinearity is a concern.

Both approaches add several new columns to your data. The above image is an example of dummy encoding. You can see that for 3 categories of the feature REGION, we have 2 new features to prepare our data for learning. Notice that the observation in the South region has been encoded to have 0s for the East and West features.

In one-hot encoding your dataset will become wider, in proportion to the number of categories in the feature. For that reason, avoid using one-hot encoding for features with a high number of categories, or high cardinality. This increase in features may drown out the signal in the feature and cause your decision tree to form many independent splits on each encoded feature, thus increasing both training and scoring time.

Binning the categories in a column, thereby reducing the total number of values (K), will help mitigate the issue of high cardinality. For example, if we had a column for a store’s “state,” one-hot encoding it will create 50 new features. Binning the state into a sub-category, like region, reduces the cardinality of features and speeds up convergence. However, in the case of higher-cardinality categorical features for which binning is undesirable (perhaps because it would drop important information), frequency encoding and mean encoding are better-suited solutions.

Ordinal Encoding

Ordinal encoding preserves the order of features as information for your model. Whereas one-hot encoding focuses only on assignment, ordinal encoding focuses on order. Categorical features for ordinal encoding must therefore be naturally ordered. In the above example, I have a feature ASSORTMENT_SCORE — a ranking of the assortment each store has from high to low — which meets our criterion of discrete categories with a measurable order. Observations with HIGH assortment are encoded as 3 and LOW assortment as 1.

In the above example, the encoding preserves the order of high to low as 3 to 1, and maintains the positive correlation to our target variable sales. As the ordinally-encoded feature increases, so does our target variable. Maintaining an ordinal encoding’s natural correlation with your target prediction will increase convergence in both linear and decision tree methods.

Frequency Encoding

Frequency encoding derives a new feature based on the relative occurence of a given category in your training set. Values that occur an equal number of times will be encoded to the same resulting value. This method speeds up learning of a decision tree model and is easy to implement; however, it comes with some shortcomings. If the frequency of a category has no relationship with the target variable or if there is little to no relationship between similar frequencies, frequency encoding may not increase the information in your model. If considering this method, first explore the relationship of categories’ frequencies to the target variable, and the relationship between similarly frequent categories.

In our store sales example above, we have more stores in the state of Ohio than in Montana. We would need to answer if more stores in a state has a relationship with sales. Likewise, both states New Mexico and Montana have the same encoded frequency. Do stores in these states behave similarly? Like many feature engineering questions, this can only be answered through exploring your data and applying univariate testing between the frequency feature and the mean target value. If these relationships exist, a frequency encoding can perform informative splits when training decision trees.

However, frequency encoding may lose information in your model if the encoded feature shares an important interaction with other columns in your training dataset. Keeping with the above example, Montana and New Mexico share the same frequency, but loses important interaction with, say, number of competitors, which may vary greatly in each state. The interaction between number of competitors and each state will be lost in our model and, if important, hurt accuracy.

Mean Encoding

Mean encoding is a darling of Kaggle competitions, but can be difficult to implement properly. Mean encoding creates a new feature with the mean of each category’s target variable in your training data. This encoding approach can embed the importance of the category in learning, which will increase convergence, decrease bias and potentially create a robust model.

But tread lightly. Mean encoding can easily cause overfitting in your model, which will make your learner appear very strong on the training set but eventually perform poorly in the real world. High-cardinality features run the risk of having low counts per category and are prime candidates for leakage and overfitting. Consider a store model with a mean-encoded feature for ZIP code. If there are only one or two stores (that is, observations) per ZIP code, the mean-encoded feature will overfit to each observation and not be generalizable.

While mean encoding is considered one of the best feature engineering methodologies in use today, there are some steps to ensure you are performing it correctly:

Always use k-fold cross validation and use the mean embeddings from your training data only. This will prevent overfitting your data to a training set.
Regularize each category’s target mean with the training data’s global target mean to decrease bias against low-incidence categories and prevent target leakage.

Mean encoding is a powerful feature engineering method for decision tree methods because splits will occur by a feature directly correlated with the target value. While it may produce a highly-informative feature, this is a difficult feature to engineer. For more information on how to properly embed mean-encoded features, visit this Kaggle research study.

Standardization of Numerical Data for Deep Learning

Standardizing numerical features can speed up learning and prevent overfitting of outliers within our feature set. In decision tree learners, standardization is not a prerequisite and could hurt the local interpretability of your machine learning model; however, it can often speed up convergence and compute time.

However, in neural net algorithms and linear methods, standardizing numerical features before learning is strongly recommended, particularly if the magnitudes of your input features vary significantly. Standardizing features to the same scale and using either a z-score or min-max scaling will weigh all features similarly in a cost function and reduce that chances of a single feature dominating the model’s learning.

For example, if we have two features in our store example of distance to the city center in miles and store size in square footage, the relationship between these features could be almost impossible to learn because of their varying scales. Normalizing on the same scale helps the model learn from this feature set and makes neural nets generalizable to unseen data.

Target Leakage in our Features

Leakage occurs when a feature contains information about an observation’s target value, even indirectly. Sometimes this information comes from an explicitly-included raw column, but most often it is hidden in a post-engineered feature and compromises the generalizability of our model. When engineering features, it is important to check features for target leakage as a safety net when learning. A simple univariate statistical check between your target value and each feature can reveal leakage — especially if it is not obvious!

Context and Expertise Matter

The above examples focus on feature engineering from a statistical perspective, much of which requires knowing your data, performing univariate tests, and trial and error. I made a point earlier that feature engineering extracts important information after deriving features based on the business problem and context. However, this process should not be perfectly linear: first business knowledge, then engineering skill. Instead, it is ideally an iterative flow back and forth between the two. I will explain with an example I experienced in a recent project, in which I was tasked with finding the peak sales times for Starbucks within Kroger.

In this project, my model failed on a certain set of stores with high sales spikes in the afternoon and late at night. “Who in their right mind buys coffee at 6pm?” I thought. When I presented this to a stakeholder, he knew right away. College students. The store that I thought was a bizarre outlier was really just located near a college campus and frequented by students looking for a late night study boost. Just by exploring the data, this information was lost on me, but for an expert who knows these stores and their behaviors well, it was obvious.

When I returned to work on my model, encoding this information into it increased its accuracy, and eventually its business value as well. Context and expertise are equally important in finding the signal in the noise when engineering features.

Conclusion

Feature engineering is the confluence of art and science. It requires knowing your data, understanding the business problem and solution, and a command of statistical methods to transform raw data into knowledge that a model can learn from. This is by no means an exhaustive view of all the methods out there, but enough to get you started in training models and solving problems.