Using Kydavra MixerSelector for feature selection

Vladimir Stojoc
softplus-publication
2 min readJan 11, 2021

Usually when we try to select the best features of the dataset, we choose different spells(algorithms) to see a difference, and select the best one, so why don’t we try to mix all up and combine the power of them all in one spell. This is where the MixerSelector comes for.

Using Kydavra MixerSelector.

To install Kydavra we just have to type the following in the command line.

pip install kydavra

Next, we need to import the model, create the selector, and apply it to our data:

from kydavra import PointBiserialCorrSelector
from kydavra import ANOVASelector
from kydavra import MixerSelector
from kydavra import MUSESelector
mixer = MixerSelector([PointBiserialCorrSelector(),ANOVASelector(),
MUSESelector(24)],strategy='intersection')
cols = mixer.select(df,'diagnosis')

The MixerSelector() accepts 2 parameters:

  • selectors: list, list of initialized selectors.
  • strategy: str, If set to ‘union’ the selector will return union of selected columns returned by selectors. If set to ‘intersection’ the selector will return the intersection of columns returned by selectors.

And the select() function takes as parameters the panda’s data frame and the name of the target column.

Let’s see an example:

I chose the Breast cancer dataset where we have to predict ‘diagnosis’ column. The dataset has the following features.

['id', 'diagnosis', 'radius_mean', 'texture_mean', 'perimeter_mean', 'area_mean', 'smoothness_mean', 'compactness_mean','concavity_mean',
'concave points_mean', 'symmetry_mean', 'fractal_dimension_mean',
'radius_se','texture_se','perimeter_se', 'area_se', 'smoothness_se',
'compactness_se', 'concavity_se', 'concave points_se','symmetry_se',
'fractal_dimension_se', 'radius_worst', 'texture_worst',
'perimeter_worst', 'area_worst', 'smoothness_worst', 'compactness_worst', 'concavity_worst', 'concave points_worst',
'symmetry_worst', 'fractal_dimension_worst']

The simple LogisticRegression gives a cross_val_score mean of 0.952569476.

When trying to mix 2 different spells with the ‘intersection’ strategy, we get:

['area_mean', 'perimeter_worst', 'concavity_mean', 'radius_worst',
'concave points_mean', 'area_worst', 'perimeter_mean','radius_mean']

and the accuracy mean accuracy became 0.9455364073901567, but it reduced the number of features from 30 to 8, amazing.

Next strategy is the ‘unoion’, with the same data and spells.

['symmetry_mean','concavity_se','symmetry_worst','perimeter_mean',
'radius_mean','radius_worst','concave points_mean', 'compactness_se','radius_se','concave points_se','concavity_mean', 'perimeter_se','texture_worst','compactness_worst',
'concavity_worst','texture_mean','concave points_worst', 'fractal_dimension_worst','smoothness_worst','area_se','area_mean',
'perimeter_worst','smoothness_mean','compactness_mean','area_worst']

So now we got 25 features out of 30, with an accuracy of 0.952569476, witch is the same accuracy we got when used LogisticRegression on all 30 features, which is a great selection.

Conclusions:

Generally, all depends of the purpose of your model, if you need a better accuracy, then its better to set the strategy parameter as ‘union’, otherwise if you need a fast algorithm, then ‘intersection’ is what you need.

Thanks for reading.

If you tried Kydavra we invite you to share your impression by filling out this form.

Made with ❤ by Sigmoid.

--

--

Vladimir Stojoc
softplus-publication

CEO at Sigmoid. A Student who is passionate about AI and have the goal to change the world