All About That Feature Engineering

tanta base
6 min readDec 7, 2023

--

This article will go over some feature engineering topics that you can use to improve and prepare your data. Featuring engineering is essentially manipulating the data in a way that refines the overall quality of your data. It is a common practice since most real world data sets are not ready for machine learning.

Applied machine learning is basically featuring engineering — Andrew Ng

abstract view of robot, with robot arms
What it would be like if the data could engineer itself

The first step of feature engineering is to know your data. This knowledge is usually gained with exploratory data analysis. Once you understand your data then you can improve upon it! So, let’s jump into these techniques.

Dimensionality Reduction

It’s true, you can have too much of a good thing. In machine learning, as you increase your features you increase your dimensional space, this is known as The Curse of Dimensionality and the solution is Dimensionality Reduction.

The Curse of Dimensionality means that as the amount of features (dimensions) in your data set increases, the amount of data that is needed to generalize grows exponentially.

Wow, there is a lot to unpack there, so let’s break it down. Features represent a specific column of data points in your data set, for example: age, height, weight, income, car model, etc. are features. Generalization is a model’s ability to make accurate predictions on new, unseen data.

In order for your model to make predictions on new data, it has to be trained with large amounts of high quality data. As you increase the amount of features, the dataset becomes more sparse, making learning and pattern recognition more difficult. In addition, you can also run into issues with not having enough compute resources for training.

abstract painting of a human form colors coming from the center of it
High dimensional spaces means more chaos

Feature Selection

If you have the domain knowledge you can simply remove features that aren’t relevant to the target you are predicting. Essentially subsetting your original features and reducing the dimensional space.

Feature Extraction

You can use unsupervised learning techniques to distill many features into condensed newer features. You can use either PCA or K-means Clustering to transform your many features (high-dimensional space) into condensed a one (lower-dimensional subspace) that represent the original information.

Unbalanced Datasets

In real world data sets unbalanced data is extremely common, I guess you could say it’s too much of one good thing. Unbalanced data sets can lead to model bias, skewed model accuracy metrics and your model may not be able to generalize. This is most often an issue with neural networks.

For example, if you are building a fraud detection model, but fraud rarely happens than your data set is imbalanced and your model may not be able to accurately predict fraud when it occurs in production data.

When doing classification and for me personally, I don’t like to have any of my classes to be more or less than 15% of the other classes.

leaning tower of pisa
An imbalance can be pretty to look at, but hard to generalize

Oversampling

Duplicate samples from the minority class, this can be done at random. Essentially fabricate more of your minority class. In the world of text classification, you can use Markov Chains to generate new sentences for minority classes.

Undersampling

Oof, I gotta say this, but generally undersampling won’t be the right solution. However, it is a technique and I have to review it anyways.

Undersampling is when you throw away (*gasp*) rows of data from the majority class. This is typically not the right approach because most of the time data is scarce and removing data can bias your model.

Usually the only reason to remove data is when there is a huge disparity among the classes and/or there are constraints on compute resources or scaling issues.

SMOTE

Also known as Synthetic Minority Oversampling Technique. It artificially generates new samples of the minority class using nearest neighbors. It’s the same idea for running K-nearest neighbors for imputation, more on that here.

Essentially, you run K-nearest neighbors on each sample of the minority class and create new samples with the the mean of the result. SMOTE is typically used for continuous data, however there are variations of SMOTE that work for categorical data.

Adjusting Thresholds

In order to compensate for imbalance of data, when your model is making predictions, you can set a threshold of probability in which you want to accept or reject the label.

For example, if you have a balance data set and a binary classification, your threshold can be set to 0.5, however if your data set has a majority of one class, 0.5 is no longer an appropriate threshold. There are various techniques can be used to find the most optimal threshold such as, ROC curve, G-mean, etc.

Outliers

First, to know an outlier you have to define an outlier. An outlier is an extreme high or extreme low data point in relation to similar neighboring variables. An outlier can also be defined by how many standard deviations from the mean it is. A standard deviation is the square root of the variance and the variance is the average of the squared differences from the mean. Some techniques are:

  • removing them (proceed with caution with this one and verify that they are outliers and not necessary for model building)
  • Cap them by defining a min or max point and set that point your outliers
  • Normalization

Transforming

You can apply some function to a feature to make it better for training. Some techniques are:

  • Rescaling is adding/subtracting a constant then multiplying/dividing a constant, can be used in converting measurements.
  • Normalizing brings all the variables to a common scale. For example, if you are working with huge numbers you can normalize to convert your data to a logarithmic scale.
  • Standardizing is done to rescale so that the mean is 0 and standard deviation is 1, which converts your data to the standard normal.
Transforming can improve model training!

Encoding

Changing your data into a numerical representation that can be used by the model. For example, converting a text to a bag of words. You can also do one-hot encoding to convert categorical data to a vectors of zeros or ones. There are various technique depending on the data type and model you are using.

Binning

This is bucketing similar observations and can be used to turn numerical data into ordinal data. For example, binning test scores into different grade letters. Two techniques are:

  • binning data based on common features, useful where there is uncertainty in your data
  • binning based on quantile category, this ensures even sized bins.

Binning can also be used to assed assessing recency, frequency and monetary value for business analysis.

Shuffling

Typically before training your model shuffling your data is a good practice. This is so your model doesn’t learn residual patterns that result from the way your data was collected. Keep that model on it’s toes!

robot looking at list checkng off items
Make sure your data is properly engineered!

Well that it’s for this article! Want more? Check out this article on imputing missing data and this series on built-in algorithms in SageMaker:

  • 1/5 for Linear Learner, XGBoost, Seq-to-Seq and DeepAR here
  • 2/5 for BlazingText, Object2Vec, Object Detection and Image Classification and DeepAR here
  • 3/5 for Semantic Segmentation, Random Cut Forest, Neural Topic Model and LDA here
  • 4/5 for KNN, K-Means, PCA and Factorization for here
  • 5/5 for IP insights and reinforcement learning here

--

--

tanta base

I am data and machine learning engineer. I specialize in all things natural language, recommendation systems, information retrieval, chatbots and bioinformatics