Feature Engineering for Large Datasets

A Few Tips

Published in

The Startup

5 min readJan 12, 2021

It is estimated that in 2025, 175 trillions of gigabytes of data will be created (that is for the year 2025 only! source). On top of this, a greater number of companies and institutions are making an effort to increase data’s availability to the public; they develop API or create open data portals such as London, UK, or Google Mobility Data, to name a few.

Accessing data has never been easier, and it will only get better!

That is the theory, at least! The question you are trying to answer can be crystal clear, and you might have the perfect metric to prove it, but finding the right dataset can still be a mission. There are many reasons why finding a suitable dataset is hard and, more often than not, being creative is necessary.

This article aims to address a “problem” data scientists are more and more likely to face:

The Curse Of Dimensionality

As said above, data is getting more prominent and more available. It means essential information will tend to be buried in a lot of “noise”. Having millions of observations and thousands of variables is a lot of work.

In a previous article, I have already talked about one of the methods to overcome hardware limitations, and there are a lot of other ways such as increasing the swap size, changing the data type of some variables or even working on a sample of the dataset. But once the dataset is comfortably tucked-in the RAM, where do you start?

Features Selection

Metadata

There might be metadata information that is not relevant to the case studied, and it can usually be removed straight away.

df.drop(columns=['url', 'id', 'source file'], inplace=True)

Note
It is worth looking at these fields once you obtain some insight. You might identify a faulty batch or some issues in the data collection process.

Missing Values

A lot of variables can mean there is a significant proportion of data missing. It is always a good idea to start an Exploratory Data Analysis by analysing the missing data. It makes you think about the data collection process, and you will have that at the back of your mind for the rest of the analysis.

If a visual rendering is possible, something like the following will be very useful:

sns.heatmap(df.isnull(),
            cmap=sns.color_palette('Set2',2),
            bar=False,
            xticklabels=[f'{col}' for col in df.columns],
            yticklabels=False
            ).set_title(f'Missing values for {sheet_name}');

If not, a list/dictionary of variables missing values with how bad the damage is will have to do:

{round(np.mean(df[col].isnull())*100,2): col for col in df.columns if np.mean(df[col].isnull()) > 0}

With this information, you can already assess if a variable is worth keeping or if completing the variable is worth it. You can then start thinking about which filling method is suitable to achieve this, and you will also get a better understanding of the type of features that constitute the dataset.

Related Features

Chances are, amongst all these variables, a few are composite of other variables. You might find rates / percentages / age-adjusted fields, or even statistical informations such as confidence intervals. Although interesting to understand the context, these fields are usually not necessary for modelling purposes.

Once these variables have been assessed, there can still be variables with correlation/covariance. df.corr() will give you the correlation between numerical columns (pairwise correlation, the closer to 1 or -1, the more related these variables are).

For a visual representation of this matrix, you can use the following code:

sns.heatmap(df.drop(columns=['Target']).corr(), annot=True);

Note
It is always a good idea to run a correlation matrix after dropping variables. The correlation factor is scaled between [-1, 1], removing the most correlated pairs will highlight the next most correlated one (which might have been hidden until then).
This is especially true for the target variable. You want to identify correlations between independent variables, to remove the dependent variable is the first step of the process.

Feature Importance

If you are trying to solve a classification problem, using a Random Forest Classifier can be useful to identify which features are the most influential. It will take some time to compute the results (see below for an optimised way to achieve a faster feature importance ranking). However, after this, you will be able to pinpoint the variables that control the classification. You will also identify the variables that have virtually no impact (remove them if necessary).

A hierarchical tree can also be helpful.

Principal Component Analysis

As a last resort, a PCA can help you identify the most important features. It is not the ultimate solution as, sometimes, a lot of Principal Components are necessary to explain an acceptable percentage of the variance observed in the dataset. These PC might also be hard to translate into actionable or KPI, for examples. But it will give you an indication.

It is worth noting you might be able to get some decent visualisations if the dataset is reduced to two or three PC/dimensions.

Intuition

Some call it business understanding, experience, or even bias… It is not exactly the most robust way to solve this kind of problem, but sometimes trusting your gut can be the fastest way to resolve an issue. It will depend a lot on how familiar you are with the case.

Optimisation

Categorising

If there are natural breaks in a population or if some categories make sense in a particular context, you can speed things up by transforming a continuous variable into a discrete one. For example, modifying the age column to reflect commonly used age groups might be a sensible thing to do.

Batch Sizes

If you encounter issues during the modelling steps, modifying the batch size might help. You can alter your code to make sure the batch size fits within the available memory.

The worst-case scenario would be to change the modelling algorithm and use a stochastic gradient descent. Practically, SGD means a batch is composed of a few randomly selected samples. The classifier/regressor will have to iterate a lot of time to cover all the observations, but every iteration will be very fast.

A positive point about these SGD is they have hyperparameters to tune and an early_stopping criterion, which makes them very versatile.

Final Thoughts

The speed at which data is now generated is frightening. Traditional methods of storage/collection/handling are easily out-paced, and especially when companies do not have data strategy in place.

Fortunately there are more and more solutions on the market to address some of the big data issues but, as an analyst, it is important to realise the way to approach this kind of data can be very different from a more concise/lean dataset.

If anything, I hope this article has given you a few ideas as to how to improve your feature selection process. I am sure with enough rigour, patience and creativity, you will (hopefully) find the needle in this big data haystack!

For any code snippets you will find in this article:

import numpy as np
import pandas as pd
import seaborn as sns