A Deep Dive into Spark’s Univariate Feature Selector

Ceren Güzelgün
Insider Engineering
5 min readJan 24, 2022

When building a predictive model, selecing proper features is critical to the model’s success. Using too many features, especially when they are not related to the prediction, can increase the model’s complexity and cause overfitting. So, the number of input variables should be optimized to reduce computing costs and provide a high-performing model.

There are various methods of feature selection, some of which are also supported by the analytics engine Apache Spark. Since at Insider, our very own AutoML platform Delphi is powered by Spark, let’s have a look into their univariate feature selector.

Univariate feature selection is the process of evaluating each feature individually against the response variable to determine the relationship between them. Numerous statistical approaches can be used to test the strength of this relationship. ChiSqSelector existed as a feature transformer since the 1.6.0 release, and with the 3.1.0 release, Spark has introduced some univariate feature selectors that are named after the underlying test, such as ANOVASelector.

With the releases 3.1.1 and 3.2.0, Spark decided to deprecate these individual selector classes and instead gather them under a single class named UnivariateFeatureSelector. The class accepts the selection mode and criterion from the user, and applies the suitable tests under the hood.

UnivariateFeatureSelector operates on categorical/continuous labels with categorical/continuous features. Given thefeatureType and labelType information, Spark then picks the score function based on those specific types.

Spark selects the suitable score function based on given types of feature and label pairs.

An example of initializing the selector:

val selector = new UnivariateFeatureSelector()
.setFeatureType("continuous")
.setLabelType("categorical")
.setSelectionMode("numTopFeatures")
.setSelectionThreshold(1)
.setFeaturesCol("features")
.setLabelCol("label")
.setOutputCol("selectedFeatures")

Currently the selection can be done with one of the five supported modes.

  • numTopFeatures : Selects the top N features given with setSelectionThreshold parameter. The default value of this parameter is 50. This was not the best mode for our use case, because it would bring a need to optimize this parameter for each ML model trained for a different customer & business problem.
  • percentile : Quite similar to numTopFeatures, but instead of a fixed number of features, a fraction of them are selected via a given percentage (by default, top 10%). Since we have a varying number of input parameters for our models per each customer, this method is more dynamic compared to numTopFeatures. Still, we’d need to optimize this threshold as well.

Before discussing the following modes, it will be useful to have a look at the concept of p-value. The are many formal explanations to p-value, most of which is often argued among the statisticians down to every single word. [an example here: “P < 0.05” Might Not Mean What You Think: American Statistical Association Clarifies P Values ]

The simplest explanation in informal terms would be that

The p-value tells us how likely it is to get a result as such, if our null hypothesis is true.

P-value is used to test the null hypothesis, or, our idea.

  • A small p-value indicates a significant result. The smaller the p-value is, the more evidence we have that the null hypothesis is probably wrong.
  • If the p-value is large, than our original idea is probably correct. Because there is not enough evidence to reject the null hypothesis, a large p-value is not significant enough.
  • To put more meaning into the relative terms that are small or large, we use the significance, or the threshold.

So when the p-value is smaller than the significance level, there is strong evidence that the null hypothesis is wrong. When p-value is larger, there is not enough statistics to believe so.

In feature selection, null hypothesis is the general statement that there is no relationship between two measured phenomena. So rejecting it tells us there is a relationship.

Now let’s move on with the rest of the selection modes.

  • fpr: False Positive Rate. This mode chooses all features whose p-values are below a threshold, thus controlling the false positive rate of selection. The value of this threshold for this mode is by default, 0.05. So by choosing < 0.05, there is strong evidence to rule out the null hypothesis. We select the features when it is possible that there is a relationship.
  • fdr: False Discovery Rate. This mode follows a similar approach to fpr, but uses a special method to adjust the p-value distribution before selecting the features that make the cut. The method is called the Benjamini–Hochberg procedure. The reason for such adjustment is to decrease the number of false positives below the p-value.
  • fwe: Familywise Error Rate. By using multiple hypothesis testing as we do, there is always a chance where we reject null hypothesis when we shouldn’t have. Statistically speaking, in 5 out of 100 tests we make the wrong decision. This mode is a way to tackle that challenge. When selecting the features, instead of collecting the ones that have a pval < threshold, it collects the ones that pval < threshold / numOfFeatures.

It is important to understand which of these selection modes is best suitable to your feature set when building the selector. On our pipeline we have observed that the final set of selected features will vary greatly depending on the chosen mode and threshold value. Obviously, the performance of the trained model depends heavily on the input features. The goal is to feed the best subset of features to our models; getting optimum results while keeping the computation costs in check.

With each new release, Spark’s feature selection capabilities improve, and we’re excited to see what else they have in store.

--

--