Feature Engineering Experiment- Weighted KNN
Using KNN to learn feature weights
One thing that I have learned the hard way about machine learning problems is that most of them require that a large proportion of man hours be dedicated not to data wrangling but to EDA and feature exploration.
This crucial step is often underplayed in most courses and discourses, primarily because it’s as much an art as science. And when I say art, it’s primarily because the possibilities on how to approach feature engineering, like hyper parameter tuning, are limited only by the computational resources and time.
These methods vary from Wrapper approaches for subset selection to feature extraction and feature construction. Where each of these range from leveraging KNNs to Random Forests to Variance Thresholds etc.
This article discusses using KNNs to learn weights for available features and then selecting the features with largest weights.
Intuitively how the approach works is
- For all the observations in the dataset it calculates similarity between each observation and all other observations. How it differs from plain vanilla KNN is that the similarity is weighted.
In Vanilla KNN with Euclidean distance we would have — distance between two observations p,q given by d(p,q)
But in the weighted form we will have something like
Where for each i-th feature there is a corresponding weight w-i
2. This similarity/distance measure is then used to select the k-nearest neighbors of a particular observation
3. Now as in KNN we use the neighbors to predict the label of the of the observation.
4. Based on the predictions we can calculate the performance metrics like accuracy or AUC or Log Loss.
This in turn will act as the loss function for our overall Feature Learning Algorithm.
ie- We will keep changing the weights using an optimization algorithm like Gradient Descent till we are able to minimize the loss function.
Let’s run some experiments in Python- Code is here #github
Import the required Libraries
Then using sklearn’s make_classification we create a dataset of 300 observations with only 5 are informative features
Then using np.random.normal we are adding 10 features which are essentially just noise
Then we initialize weights to all zeros and set k for knn to 5
Then we wrap the KNN algorithm in a function. Lets go over the overall approach.
- we are iterating over all pairs of observations
2. and calculating the weighted absolute distance between them.
3. Then we pick the k- nearest neighbors and use them to predict,for each observation, the correct label.
4. Then we use the predicted probabilities to calculate the accuracy score
Below is the complete code for knnWeights function
Then we use scipy’s minimize to learn the weights
To get a sense of how good the learned weights are we can run KNN using all features and compare it with KNN run using only the 5 features with highest weights (remember while creating the dataset we had set n_informative to 5). We also randomly select a subset of features and compare results for that as well.
- Results using KNN classifier with all features
- results Using KNN classifier with features with highest weights
- results using random subset
We can combine these metrics — accuracy and ROC per experiment into a dataframe and run the experiment 10 times to get an overall sense of how
From 10 experiments the results were
where rfa = Random Feature subset Accuracy
rfauc = Random Feature subset AUC
woa = No feature selection Accuracy
woauc = No feature selection AUC
wa = Feature selection Accuracy
wauc = Feature selection Accuracy
To get a better sense we can aggregate the results
What we see is that Random Feature subset selection has worse mean performance and Features selected using KNN have the highest mean performance. No feature selection yields performance in between the two approaches.