Using Python to evaluate feature sets

Published in

Stephen Godfrey’s blog

5 min readMar 26, 2019

Overview

The scikit-learn machine learning for Python provides useful functionality for evaluating models and selecting an optimal set of variables or features (scikit-learn.org). In fact, the tool set provides a number of classes specifically designed to help model builders select a set of explanatory independent variables or features. For example, the feature selection offers tools for eliminating features with low variance, selecting features based on univariate statistical tests, recursive feature elimination and feature selection by selecting from a model. All of these approaches can provide useful and powerful ways to select a core set of explanatory or independent variables.

In this blog post, I approach the feature selection question from a somewhat different perspective. This approach is based on the idea that it can be useful to examine the performance of a wide range of models employing various feature sets before making a final determination on which variables to include and which to exclude. If feasible, such an examination can provide the model builder with helpful insights into both the explanatory usefulness of individual variables and the benefit or costs of grouping variables into sets.

Approach

The approach employs scikit-learn’s modeling classes to build a large number of models and then panda’s DataFrames to examine them. In this example, it will be demonstrated using linear regression models applied to the Boston house-prices dataset accessible through scikit-learn.

The two-step process, detailed in the code at the post’s end, is

Build a model evaluator function that

Instantiates a linear regression model using scikit-learn’s LinearRegression class and declares a model_library list
Establishes a for loop using itertools to determine all possible combinations of features between a minimum and maximum and for each feature combination fits and scores a linear regression model with this feature set, X, and the target variable, y
Appends the output in a list of dictionaries and returns a DataFrame

2. Use the DataFrame to examine the models by sorting, extracting values and plotting

Examining models and feature sets

Now that the model variables, coefficients and scores are stored in a DataFrame (model_df), we can use pandas functionality to explore the results. In this case, we evaluated 8,191 feature combination sets which is the count of the 13 one-feature models plus the count of the 78 two-feature models and so on through the count of the 1 13-feature model. The number of models can be found by examining the shape of resulting DataFrame or doing the combinatoric math.

DataFrame (model_df) of model_evaluator output sorted by CV_R2 score

The first step in examining models might be to sort this DataFrame (model_df) by the cross validation score, CV_R2. If we do this, we see the best performing linear regression model has 10 features out of a possible 13 and achieves a CV_R2 = 0.505. We might also look at the worst performing model which in this case turned out to be a single-independent-variable model with the CHAS or Charles River dummy variable as the sole feature.

Plot showing model feature sets and performance

It is also interesting to plot the models by number of features against their cross validation scores. We can make some useful observations from this chart. First, there are a large number of models for which the cross validations scores are negative. While this won’t happen with R² scores, it can happen when the data are split and the models are evaluated on test data. Second, we can gain some intuition regarding the benefits of adding variables by comparing model scores at each step on the y axis. For example, we can see how the the 13-feature model compares to the distribution of scores among 12-feature models. Third, there is some consistency among the best scores for models with at least 8 features. This can be seen by examining the vertical blue line which is drawn through the best scoring model (10 features). The best models with 8 features approach this value with the top model’s CV_R2 at 0.492. Fourth, the best two-feature model does almost as well with a CV_R2 = 0.430. In this case, those are the PTRATIO or pupil-teacher ratio by town and LSTAT or percentage lower status of the population variables. Finally, we can use the chart to understand the worst performing combinations. In the case of two variables, these the ZN or proportion of residential land zoned for lots over 25,000 square feet and DIS or weighted distances to five Boston employment centres features.

DataFrame (model_df) examining top 2-feature models

Considerations

This approach to feature selection does present some challenges. First, it can produce many models that may need to be evaluated and ultimately selected or rejected. If the goal is to select a single feature set, some criteria must be applied to which model or models to accept. Simple decision rules, for example accept the highest scoring model and reject the others, or arbitrary heuristics, such as select the best 8-feature model, may need to be enforced to make the output operational. In this case, we produced over 8,000 models and examined but a few. Still, the process did surface those select feature sets for further investigation.

Second, this process can be computationally intensive especially if the feature or datasets are large. This example took my computer 113 seconds to run on the Boston housing-price data. However, the computation needs rapidly grow as feature sizes increase and it may become impractical at vary large sizes.

Conclusion

While a number of tools are available to select features, it can be beneficial to select features based on their performance in various model combinations. One way to do this is to build functions that, through brute force, consider a wide range of models and then to employ an approach for evaluating the associated feature sets. Here, we demonstrated that the Python libraries of scikit-learn and pandas have useful classes for such a task.

References

Learning Data Science: Day 9 - Linear Regression on Boston Housing Dataset

Yesterday we have learned about the basic concept of regression. For a starter like me, linear regression seems to fit…

medium.com