Introducing sklearndf: Enhancing Your Data Science Workflow with Data Frame Support and Feature Traceability

Malo Grisard
GAMMA — Part of BCG X
7 min readApr 15, 2021

--

Authors: Malo Grisard & Jan Ittner

The software library scikit-learn [1] is an industry standard for machine learning (ML) and is used extensively by companies to productionise data science. At BCG GAMMA, we use scikit-learn on hundreds of projects every year. One scikit-learn challenge we often encounter in our work is that data transformations and pipelines do not preserve feature names and, instead, produce unlabelled numerical arrays. But in order to enable the interpretation of ML models and discuss the model in a meaningful way, data scientists must be able to relate the variables in the resulting models back to the original features.

For complex ML pipelines, data scientists will thus spend considerable time writing and maintaining additional custom code to reconstruct feature names for transformed data sets — time that could be better spent on additional analysis and insights. In addition, this type of custom coding is a substantial contributor to technical debt. Technical debt describes how incremental enhancements of program code incur an increasing cost of maintenance over time. ML systems are one of the best-known examples in which technical debt occurs often and compounds quickly [2]. In the case of ML pipelines, having to add custom code to enrich the output with meaningful feature names will increase complexity and interdependencies in the code, cause a high future cost of maintenance, and increase the risk of errors. It could also raise the effort of future model enhancements to such a degree that it may actually discourage analytic exploration.

To improve workflow and reduce technical debt for companies productionising data science using scikit-learn, we created sklearndf, an open-source Python library from BCG GAMMA. This new library augments scikit-learn estimators for native support for data frames and enhanced feature traceability, while keeping the original API intact. It is also an essential enabler for GAMMA FACET, our Python toolset for human-explainable AI.

sklearndf provides the following benefits:

Data frame support

sklearndf returns pandas [3] data frames as outputs of all transformers and pipelines, preserving feature names as the column index.

Feature traceability

Three new attributes, feature_names_in_, feature_names_out_, and feature_names_original_, enable feature traceability back to original inputs, even across complex pipelines.

Ease of use

sklearndf is designed to minimise mental load for new users. It fully preserves scikit-learn’s original API, except for adding a DF at the end of each class augmented for data frames. For example, SimpleImputer becomes SimpleImputerDF in sklearndf.

Object orientation

sklearndf embraces the object-oriented paradigm more consequentially as scikit-learn, establishing a systematic class hierarchy that distinguishes TransformerDF and LearnerDF as subclasses of EstimatorDF, and RegressorDF and ClassifierDF as subclasses of LearnerDF. Anyone who makes heavy use of object-oriented python and/or type annotations will note the benefits of this approach to produce more expressive and maintainable code.

At this point we need to stress that we would never want the native scikit-learn library to adopt the same approach, and that we don’t see ourselves as trying to “fix” scikit-learn: it is a foundational and broad toolset designed to take a more minimalist approach to its dependencies with third-party packages such as pandas. Having said that, we also note that scikit-learn’s engineers are already improving support for feature traceability. SLEP15 [4] provides a good insight into current considerations to extend scikit-learn’s ability to propagate feature names along pipelines. We are already looking forward to integrating these results in our implementation of sklearndf.

Case Study: Predicting Customer Churn

To illustrate the benefits that sklearndf offers to users of scikit-learn, we will use the example of a typical pipeline for a classification task applied to predicting customer churn. Customer churn refers to consumers who switch to a competitor for services (i.e., changing internet providers) or who end their services (i.e., canceling your Netflix subscription). As it is more costly to acquire new customers than to retain existing ones, churn prevention is a key task for many businesses. Predicting customers with a high churn likelihood allows business teams to implement actions, such as discounts, that reduce churn.

We will use the well-known Telco Customer Churn dataset from Kaggle for our example. The dataset contains one row for each customer and includes information on those who left within the last month (i.e. churned), along with services signed up for, account information, and demographics. The full code is reproducible with this example notebook.

Improving a Typical Preprocessing Pipeline

Every ML pipeline begins with some form of preprocessing. Common tasks include imputation for missing information, encoding categorical features (i.e., one-hot encode) or transformations of continuous features (i.e., log transform).

Figure 1 shows the minimal changes required to convert a typical scikit-learn preprocessing pipeline into a sklearndf enhanced version. The pipelines are almost identical, except for the addition of DF to end of every scikit-learn estimator class. For example, SimpleImputer is now SimpleImputerDF.

Fitting the preprocessing pipeline to our data, we can also see that when using native scikit-learn the output is now a pandas DataFrame instead of the usual numpy array.

Having the data frame as an output gives users reliable visibility of features. sklearndf also adds properties to estimators to keep track of feature names. Each estimator keeps track of the features it was fitted with via feature_names_in_ and describes how transformers provide output feature names via feature_names_out_.

sklarndf also allows us to trace features across the pipeline: We can use feature_names_original_ on any transformer or pipeline to get a mapping of each output feature onto its input feature (Figure 2).

Figure 2. With feature_names_original_ it is straight forward to trace features from post processing back to the input feature matrix or identify feature derivatives.
Figure 3. With feature_names_original_ it is a straightforward process to trace features from post-processing back to the input feature matrix or to identify feature derivatives.

Improving your ML Pipeline

Pipelines are frequently a series of transformations followed by a final learner step. To simplify the ML workflow for these common cases, sklearndf introduces two new pipeline classes, RegressorPipelineDF and ClassifierPipelineDF. These two classes provide additional guarantees about the structure of the pipeline, thus helping to make typical data science code more streamlined and legible. In both cases, there are two required arguments: (1) an optional preprocessing pipeline and (2) a learner, i.e., a regressor or a classifier.

With the preprocessing as shown above, let’s fit and score a random forest classifier to predict customer churn. First, we create a pipeline for the learner including preprocessing using sklearndf. Then we specify the train-test split. Finally, we fit and score as we normally would using scikit-learn (Figure 2). We use a simple train/test split here for demonstration purposes, knowing that in a “real” data science project you would want to use a more robust cross-validation approach.

Model training using sklearndf augmented pipelines.

The sklearndf learner pipelines are fully compatible with regular pipelines and support diagnostic functions such as cross_val_score. With the fitted model, we can also generate the usual model-performance summaries (Figure 3).

Figure 3. Typical model performance summary for a trained classifier with scikit-learn.

sklearndf learner pipelines are also compatible with typical scikit-learn cross-validation workflows using GridSearchCV and RandomizedSearchCV (Figure 5).

Cross validation using sklearndf augmented pipelines.

From the grid search results we access the best learner (best_estimator_), which in this example is the random forest classifier with the hyperparameters that achieved the best mean CV performance re-fit using all available training data.

Finally, sklearndf provides selected useful third-party estimators that follow the scikit-learn idiom. Currently, this includes LGBMRegressor, LGBMClassifier, and Boruta, which are provided via the .extra modules (for example sklearndf.transformation.extra). See here for simple examples.

Summary

We hope that the above explanation sparks your interest in sklearndf. If you, like us, use scikit-learn in conjunction with pandas, sklearndf will help you streamline your scikit-learn ML workflow while producing richer and more consistent output.

sklearndf is available from Anaconda:

conda install sklearndf -c bcg_gamma -c conda-forge

or PyPi:

pip install sklearndf

Check out the GitHub repository and the documentation for worked examples and tutorials.

Complete your ML workflow with sklearndf and GAMMA FACET

Check out GAMMA FACET, our open-source library for human explainable AI. It sits on top of sklearndf and provides a comprehensive and unique toolset for ML model interpretation. Check out this GAMMAscope article, the repository, and our tutorials.

References

[1] scikit-learn: Pedregosa, F., Varoquaux, Gael, Gramfort, A., Michel, V., Thirion, B., Grisel, O., others. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.

[2] Hidden Technical Debt in Machine Learning Systems, NIPS 2015, https://papers.nips.cc/paper/2015/file/86df7dcfd896fcaf2674f757a2463eba-Paper.pdf

[3] pandas: https://pandas.pydata.org/ Wes McKinney, Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, pages 50–61, 2010,

[4] scikit-learn feature names improvements: https://github.com/scikit-learn/enhancement_proposals/pull/48

--

--