How Our Auto ML Platform Handles Feature Selection

Ceren Güzelgün
Insider Engineering
5 min readAug 1, 2022

To manage all of our machine learning workload, at Insider we have developed our very own auto ML platform; Delphi. The name originates from the famous sanctuary in ancient Greece, which was believed to be the center of the world. Likewise, Delphi is the center of our ML-based products.

Delphi handles every step of the machine learning pipeline. It maintains multiple feature stores of thousands of features that were built for different sorts of business problems. It populates these stores with historical and recent data on a daily basis. It trains unique models for each business problem per partner, totaling more than 2500 models per week.

Insider collaborates with a large number of partners from many countries and industry sectors. Users of an airline website in Asia will behave very differently than users of an e-commerce website in the US. Delphi accounts for these differences and makes us provide similarly precise predictions for all the brands we’re working with regardless of their region and business vertical. Feature selection lies at the heart of what we achieve with Delphi.

Insider’s in-house auto ML platform Delphi’s infrastructure flow.

For more than ten algorithms that provide predictions for various business problems, the feature selection step is handled by a single process. The designs of the algorithms are significantly diverse. They feed from multiple feature sources and have various labels, depending on whether they are numerical or categorical. Some of these are regression algorithms, and some of them are classification.

Starting with statistical selection, feature label pair tests are run. The next step is variant selection, where we choose from a feature’s different variations. What follows is the feature elimination, where only one of the strongly correlated features survives. The final phase is recursive selection, where the so far selected features are fed into a tree-based model in an iterative process. There will be a detailed explanation of each stage.

The selection job requires an algorithm parameter to know which feature stores are used by the algorithm. There are multiple feature stores for various sorts of information. If we want to predict whether a user will uninstall an application from their phone, we will refer to the Mobile Feature Store where we have information regarding their device, and their activities on the mobile platform. If we want to predict whether a user will open an email or not, then the information we need is stored in Email Feature Store; where we store features related only to the users’ email habits.

With the algorithm parameter, the selection job knows which feature stores are suitable for that problem. This parameter also gives insight on the nature of the algorithm. Is it regression or a classification problem? What is the minimum number of features we can accept at the end of this process? What are the importance scores we declare that are too high, or too low? Once the job knows this information, it creates a data frame of all the existing features from the suitable stores. The stores are built with general purposes, so not all features are gonna be relatable for every algorithm. So, we give them all to the feature selection job and let it decide.

During statistical selection, the job sorts available features and creates groups of feature-label pairs of various types.

Tests applied to different types of feature-label pairs.

When the label is categorical, Anova test is applied for numerical features and ChiSquare test is chosen for categorical features. When the label and features are both numerical, we apply the Pearson test. Statistical selection is the lengthiest step of the entire process, so we have added an option to directly skip to the variant selection when needed. This option is often used in the staging environment where we conduct experiments. On production, features get statistically selected first.

Then comes variant selection, because our features have variations. A single feature is computed for four data spans, for example. If the feature is about the number of purchases a user makes, we aggregate that information for the past week, month, 3 months, and 6 months. We also store the square root, square, and natural logarithm of that information. This totals 16 variants for a single feature, but we only want the best variant to be given to the final model. So we get the feature importances from a tree-based model, then continue with the variants that have the highest importance.

Four stages of feature selection.

Before the last step, there is a feature elimination process based on feature-feature correlations. If two features are highly correlated with each other, only one of them makes it into the final model. To decide on which features to eliminate, two correlation matrices are created; one for categorical and one for numerical features. If the correlation between these features is higher than the determined elimination threshold, the feature that has a higher correlation with the label stays.

Let’s do a recap before moving on to the final stage. First, all available features were subjected to statistical tests against the label. The resulting features could be of different variations of data spans or orders, that’s why the best variant of every feature is selected in a variant selection step. The features resulting from this stage may be highly correlated with each other, so the feature elimination step takes care of that and picks the best features in highly correlated pairs.

The final stage is recursive selection. Here, we train a tree-based model that is fed with the selected features. The model is then evaluated. Features that fall below a certain importance score are removed, another model is trained with the remainder, and the entire process is repeated for the number of times that we desire. In the end, scores from each iteration are looked at and the model with the best evaluation score decides the final features.

Thousands of features get calculated by our auto ML platform every day. Not all of them are relevant for every algorithm we created to solve business problems. We also train a unique model for each of our customers, and their users have varying behaviors depending on the region and industry. This poses a challenge in selecting the relevant features for each model, because creating features that are relevant to the problem, and selecting them are crucial for training a high-quality machine learning model. Our feature selection job starts with over 2000 features that are possibly relevant and has the capability to eliminate until a dozen of them are left.

We will keep researching and conducting experiments on our pipeline to reach the best possible version it can be.

--

--