Feature Engineering for Machine Learning (2/3)

Part 2: Feature Generation

Published in

Towards Data Science

11 min readMar 14, 2022

Pyramids rising up from the desert floor — Image by Pete Linforth from Pixabay

In this second of the three-part series on Feature Engineering (Part I: Data Prepreprocessing), we’ll see there’s an almost infinite number of ways to build new features from existing ones, so the art in Feature Generation, once you’re aware of the basic techniques described below, is really in gaining an intuition on what to try for a given problem domain.

“Good features allow a simple model to beat a complex model”
— Peter Norvig

2. Feature Generation

For this article, we’ll be jointly describing both Feature Extraction, which generally refers to domain-specific methods of dimensionality reduction, as well as Feature Generation, usually accomplished via i. mapping existing features into a new space, ii. combining multiple features into a composite, iii. aggregating data to find patterns or iv. merging auxiliary data. We’ll be grouping methods by their applicability to the underlying data type.

2.1 Date & Time

Events often exhibit periodicity or seasonality. The periodicity may manifest at more than one time-scale so, depending on your data, you may wish to decompose a timestamp column into multiple columns, such as: Minutes, Hour, Day of week, Weekday-or-Weekend, Day of Month, Month, Season or Year. Doing so will also let you use pd.DataFrame.groupby() to perform aggregations, which is in itself one of the most powerful ways to generate new features. One easy way to break up a timestamp is to use Pandas’ string methods, e.g.pd.Series.str.split():

# ‘date’ column in transactions is encoded as “01.04.2022”
transactions_date = transactions[‘date’].str.split(‘.’)# add three new columns to the dataframe
transactions[[‘day’, ‘month’, ‘year’]] = pd.DataFrame(transactions_date.tolist())

Once these groupby() columns are created, to predict future sales of a product for example, you can calculate and add the following new features to your model: i. number of items sold on the same day last week, the ii. average daily sales from the last month, and iii. sales from the same month last year.

Another very useful feature that can be generated is time deltas, such as the difference between two feature dates. For example, when predicting customer churn, the number of months since the start of service/subscription, or contract expiration, or the last call to customer support. In fraud detection, it could be days since the last credit-card transaction for this customer. In consumer spending, the days since the last bi-weekly pay period (the most common period according to the U.S. Bureau of Labor Statistics). In medical diagnosis, subtracting the date-of-birth to obtain age, then further subtracting the 20th-percentile target-positive age to obtain age in excess of disease onset.

When events occur across multiple time zones, it may be beneficial to encode the local time for each sample if their corresponding location is known. E.g. a U.S. online retailer serves customers across 4 time zones (Pacific, Mountain, Central and Eastern), and when predicting click-through rates, whether it’s 8pm or 11pm for the customer may make a difference in how they respond to recommendations or ads.

2.2 Geolocation

For geolocations, say if predicting home prices, one way to add new features is to supplement with auxiliary data. You can add distances to relevant landmarks: city-center, places of worship, good schools, etc. If the geolocation data has been anonymized (i.e. non-absolute coordinates), you can still perform clustering and then calculate the distance to cluster centroids.

Local minimas/maximas are also useful points of references for calculating distances from, serving as proxies for desirable or undesirable geographical entities. Yet another feature is surrounding building density, with high densities indicating popularity, confluence of transportation options, etc.

2.3 Pricing

To an algorithm, there is hardly any difference between an item priced at $5.00 versus at $4.99, but we know humans aren’t rational. Therefore, if predicting sales for example, you could add a flag for prices that end with $0.49 or $0.99 (or add a column for fractional part of the price and let the decision tree figure out the patterns by itself). Along the same vein, it is well known that certain price points have psychological resonance, and serve as resistance-levels, e.g. $19, $49, $199, so add a (modulo-x) feature to help your algorithm detect them.

2.4 Numerical

Linear Regression isn’t limited only to linear relationships — it just needs to be linear on the independent variable, i.e. the features. So adding polynomial features will allow LinR to be much more expressive, and reduce fitting bias. sklearn.preprocessing.PolynomialFeatures() makes adding them a cinch, and even supports interaction features, as discussed next.

Numerical features are good candidates to add interaction features (also called interaction terms). These are formed by performing Summing/Subtractions/Divisions/Multiplications between two or more features. They are especially beneficial for tree-based models which have difficulty extracting such dependencies.

The difficulty for us, of course, is there’s an infinite number of ways to create such interactions, and only some are helpful. E.g. calculating price/sq-ft given price and floor area, average household size given census block group size and number of households, multiplying pressure by temperature to predict rate of chemical reactions.

t-SNE is usually used as a visualization aid, but you can also use its projections as features. Be warned though that t-SNE is extremely sensitive to hyperparameter (perplexity) values, so you’ll need to experiment to find what works best. If you do use t-SNE as a feature, do not use scikit-learn’s implementation (sklearn.manifold.TSNE) as it does not have a .transform() method. Instead use openTSNE.TSNE.transform() which not only is faster, but crucially, supports embedding of new points. If projecting to only 2-dimensions is too limiting for your data, you can give UMAP a try.

Bucketing of continuous variables may be useful if there are multiple distinct peaks in the feature histogram. A plot of longitude versus home prices shows no discernible relationship, but plotting the histogram reveals multiple peaks, corresponding to cities that lies within longitudinal ‘bands’ — these bands can can be used as thresholds for binning continuous features into categories.

2.5 Categorical

Categorical interaction features are useful for Linear and k-NN models, but not tree-based models. They are formed by string-concatenating two categorical features to form a new composite, e.g. concatenating ‘sex’ ∈ {M, F} and ‘education-level’ ∈ {1, 2, 3, 4} to give ‘sex-edc’ ∈ {M1, F1, M2, …}. Good candidates are features that are frequently referenced on adjacent decision-tree nodes.

Categorical Target Encoding (also known as Likelihood or Mean Encoding) is very popular on Kaggle competitions. It is the encoding of each categorical class as some function of the target class values. Popular functions include:

Mean: P(Y=1 | X_i)
Weight of Evidence (WOE): log( P(Y=1 | X_i) / P(Y=0 | X_i) )
Count: #(Y=1 | X_i)
Difference: #(Y=1 | X_i) − #(Y=0 | X_i)

e.g. Mean can be calculated using pd.df.groupby('feature')['y'].mean()

Decision-trees have difficulty with high categorical cardinality (many unique values) due to max-depth limitations (you can only go one of two ways at each node) — target encoding results in lower loss for the same tree-depth.

The main problem with target-encoding is risk of severe overfitting, or data-leakage. One way this can happen is when the data distribution changes between the training and validation/test sets (e.g. the target mean changes). K-fold cross-validation and additive smoothing are two ways to mitigate this risk.

2.6 Statistics Aggregation

This is one of the most powerful ways to create new features from existing ones. Three choices have to be made: i. what to group by, ii. how to aggregate each group, and finally, iii. on what key to join the aggregated statistics on. Here are some examples:

Calculate the Min/Max/Std-Dev (aggregation) of the cost of all ads shown to a user (join-on) on the same page (group-by). Can be useful for predicting Click-Through Rate on a specific ad by providing context, i.e. more costly ads on that page will be more prominently featured and will therefore negatively impact cheaper ads placed on the same page
Number of pages the user has already viewed during a session, i.e. recommendation/ad fatigue. Number of symptoms a patient has, i.e. any one symptom may not be significant, but multiple symptoms may
For each customer over the last year, the median time between two consecutive transactions, mean number of transactions per month, mean and std-dev of charges per spending category, all can help predict fraud by informing how frequently the card is typically used, on what, for how much and the pattern of usage (ad-hoc vs regular automatic payments)

Pandas’ .groupby() is indispensable for calculating these statistics, but groups may also need to be synthesized on-the-fly from numeric features, e.g. the average price/sq-ft of homes within a set radius.

Sometimes data is spread over multiple tables, in which case, supplement the main (e.g. transaction) table with insights gleaned from other tables, e.g. adding a flag in the transaction table if the merchant is associated with a recent spike in disputed transactions.

2.7 Text

In Part I, we preprocessed out raw text and tokenized them into small chunks of text. Now we’ll need to convert these sequences of chunks to numbers, which is the job of the vectorizer.

A Bag-of-Words (BOW, also known as Count) vectorizer counts the number of occurrences of each unique token in a document. It’s common to have options to ignore words that appear too infrequently, or too frequently. A Term Frequency (TF) vectorizer normalizes raw counts to show what proportion of document each term constitutes (i.e. vector elements sum to one). A TF-IDF vectorizer further multiplies the TF by the log of the inverse of how frequently each term appears in the document corpus.

Word and sentence embeddings (e.g. Word2vec, GloVe, fastText) provide the next level of sophistication in text feature generation. Note that due to the way that these are trained, i.e. un-supervised training, the mathematical ‘closeness’ of the resulting word embeddings reflects not semantic similarity, but rather co-occurence of words — e.g. “Love” and “Hate” occur frequently together in a sentence and will thus be mapped close together in the embedding space even though they are antonyms.

Also note that these simple embeddings are static and for each word, there is but one, and only one, corresponding embedding. As such, they are not capable of semantic disambiguation, i.e. the ‘bear’ in black bear, bear market and bear arms will all be assigned the same vector embedding.

When it comes to embeddings (such as with Word2vec), you have a choice as to whether to use pretrained embeddings, or have the model learn its own embeddings on your data using back-propagation. Sentences can also have embeddings, if you wish to create a single feature that summarizes the sentence, by averaging individual word vectors together, using the embedding of the [CLS] token, or using something like Doc2vec. For phrases (e.g. “frying pan”), try sense2vec.

One final trick to create additional text features is to perform double-translation on each piece of text (e.g. English → Spanish → English), and then predict on both corpuses using an ensemble.

2.8 Image

Grid of colored squares — Image by Jan Tik marked CC BY 2.0.

Pretrained models, typically CNNs, trained on large datasets learn to represent whole images as latent representations in their final layers. By truncating the ‘head’, but keeping the ‘body’, these models can be used as feature extractors. Feed an image in and out comes an image embedding, with common output dimensions of 768 or 2048.

Not only can you pick from different model architectures (e.g. VGG16, ResNet50, MobileNet), you can also pick weights based on training the model on a domain-specific subset of images. Why this works is because of the No Free Lunch theorem: there is no one-size-fits-all, universally optimum solution. TensorFlow Hub offers a selection of ResNet50 models, identical architectures but trained on different subsets of ImageNet categories. Using one trained on the flower subtree, I found I was able to obtain significantly higher accuracy on the Oxford102 dataset than using weights trained on the entire ImageNet-1K (i.e. relevance beats abundance).

For the state-of-the-art, just as Transformers revolutionized NLP through the application of attention, one can also use a pretrained Vision Transformer (ViT) for extracting (768 to 1024 dimension) vector representations of an image.

2.9 Time-Series

For time-series data, in addition to those already mentioned in [Sect 2.1], one can generate new features by adding one or more lag features. Simply delay or shift, e.g. pd.DataFrame.shift(), the data back by a set amount of time. Use autocorrelation, e.g. statsmodels.tsa.stattools.acf(), to determine the most predictive lag period(s).

Another technique, which has the added benefit of also detrending a time-series, is to encode the series as a difference or delta from some period back. For time-series that have a strong trend, e.g. stock prices over long timeframes, it’s better to use ratiometric or percentage change rather than absolute change, as that better reflects how prices change. Both of these transformations (difference or ratiometric) can also be applied to the target variable (have the model predict the following day’s change rather than the absolute value).

Moving-window aggregations (also known as trailing indicators) can also add context. For example, the moving average of prices can be used to detect market trends. Again, using different aggregation time-spans can be helpful, e.g. traders often use 50 and 100-day SMAs. Expanding-windows can also be used, but are less common.

Then there are domain-specific features, such as adding a feature to indicate how much of a gap-up/down there is from the previous day’s close. The MACD indicator is computed using the difference between two Exponential Moving Averages (EMAs) and is used to detect when trends are accelerating. For weather-related time-series, calculate the dew point (a function of temperature and pressure).

Time-domain data can be transformed into the frequency-domain. Performing a Fourier Transform alone falls more in the category of feature extraction than feature generation, since the transformed data is no longer anchored in time, and so cannot be used alongside time-domain features, such as those described above. However, time can be incorporated back in, for example in the case of spectrograms, we can track power spectral density changes as a function of time.

2.10 Representation Learning

Representational Learning (RL) refers to learning latent representations using non-parametric (i.e. non-statistical) methods to extract features. Some examples have already been described, such as word-embeddings and CNN/ViT-based image feature-extractors.

RL has been very successfully applied in computer-vision. SimCLR for instance, can be used to learn visual representations from unlabeled data using unsupervised learning. Then, by introducing just a small subset of labels, the learned representations can be leveraged to classify the rest of the dataset with remarkable accuracy.

Feature extractors are particularly useful for performing similarity searches. This is because once trained (or if using one that is pretrained), the extracted feature vector is invariant for a set input. In similarity search, you are comparing one sample against the entirety of your database, based on a similarity metric such as cosine similarity. By pre-computing all feature vectors corresponding to your raw inputs just once, you can very efficiently store and subsequently perform nearest-neighbor similarity search using a special database such as FAISS.

This concludes Part II of this series on feature engineering. We have now amassed a profusion of features, so in Part III we’ll turn our attention to Feature Selection, where we look at ways to separate the wheat from the chaff to ensure we’re not overfitting on our data.