How to keep feature names in sklearn Pipeline

Track the feature names in Pipeline with Transformers

Anderson Rici Amorim
5 min readOct 20, 2022
Image source: https://pt.wikipedia.org/wiki/Scikit-learn#/media/Ficheiro:Scikit_learn_logo_small.svg

Using scikit-learn Pipeline is a great way to avoid data leakage on the modeling process. Basically, data leakage is the use of information in the model training process which would not be expected to be available at prediction time, causing the predictive scores (metrics) to overestimate the model’s utility when run in a production environment. You can learn more about it here.

With sklearn Pipeline class, we may add as many data preprocessing steps to our model pipeline as we need, such as impute missing data, scaling, transform categorical features, and so on. As a final step, the predictor to be trained is put at the end of the pipeline in a way that the input data will be already preprocessed.

However, one of the main drawbacks of using the sklearn Pipeline class is that sklearn Transformers return Numpy arrays. Thus, even though we input a Pandas DataFrame with the names of the features, we will lose them on the pipeline execution. This is a bad thing when we need to explain our model using, for example, feature importances, because we are no able to track feature names natively.

Well, but now we can! The scikit-learn community just released a new API named set_output, which makes possible to track feature names in sklearn pipelines, setting the transformers to return, for example, Pandas DataFrames.

So let’s get into it!

Hands on!

First of all, at the time I’m writing this article, the API is under development. So, to use it, we need to install the nightly build of sklearn. You may find out how to do it here.

Well, to illustrate how to use the set_output API, we are going to create a classifier with sklearn Pipeline, using Transformers to preprocess the data, with the goal to predict Titanic survivors.

So, let’s import the packages we need:

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn import set_config
import matplotlib.pyplot as plt%matplotlib inline

Now, let’s get the dataset and split the data into train/test:

X, y = fetch_openml(
"titanic", version=1, as_frame=True, return_X_y=True, parser="pandas"
)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, random_state=0)X_train.head()

As we can see, the dataset has some categorical features, so we need to proprocess and transform the data. Moreover, there are some features I don’t want to input to the model training, so I’m just dropping them. The idea here is to use sklearn Transformers to preprocess the data:

num_features = ['age', 'fare']
cat_features = ['embarked', 'sex', 'pclass']
# here we call the new API set_config to tell sklearn we want to output a pandas DF
set_config(transform_output="pandas")
# creating the numerical pipeline
num_pipe = Pipeline([
('imputer', SimpleImputer()),
('scaler', StandardScaler())
])
#creating the transform to preprocess the data
transformer = ColumnTransformer(
(
('numerical', num_pipe, num_features),
("categorical",
OneHotEncoder(sparse_output=False,
drop="if_binary",
handle_unknown="ignore"),
cat_features
)
),
verbose_feature_names_out=False,
)

The trick here is the set_config(transform_output=”pandas”) line, where we are calling the new API and telling to sklearn that we no longer want to output Numpy arrays in Transformers, but Pandas DataFrames instead. This way, we are able to keep the feature names! With the code above, I’ve created pipelines to preprocess the numeric and categorical data, then just gathered both of them into a ColumnTransformer.

Now, we just have to put the preprocess pipeline into a new Pipeline object with the predictor to train the model. Here, we are just using a Random Forest Classifier with no tuning stage, once the idea is only to show the use of the new set_config API.

# creating the classifier pipeline with a data preprocessing step and RF classifier
rf_pipeline = Pipeline([
('dataprep', transformer),
('rf_clf', RandomForestClassifier(n_estimators=100,
max_depth=10,
class_weight='balanced',
random_state=123,
verbose=0))
])
# training the model
rf_pipeline.fit(X_train, y_train)
Our trained pipeline.

As we can see, the predictor is the last object into our Pipeline object. So we need to get it and see if we were able to keep the feature names with the new API.

# retrieving the RF Classifier from the model pipeline
clf = rf_pipeline[-1]
# making a pandas dataframe
data = list(zip(clf.feature_names_in_, clf.feature_importances_))
df_importances = pd.DataFrame(data, columns=['Feature', 'Importance']).sort_values(by='Importance', ascending=False)
df_importances
Pandas DataFrame with feature names and their respective importance scores.

The feature_names_in_ and feature_importances_ attributes store the names of the features seen during the fit and the the impurity-based feature importances, respectively.

And how we can see, we were just able to keep the feature names into the whole pipeline! We may just plot it to better visualization:

df_importances.plot.barh(x='Feature', y='Importance')

Conclusion

With the new set_config API, we are now able to keep the feature names in sklearn Pipelines. This is important because it allows us to better explain the model, wich is an important step to its adoption, in a business point of view.

As we noticed, with just a line of code we called the API and it works just fine, doing exactly what it is supposed to do! Cheers to sklearn community (:

I hope this article may help you! You may find the whole code into my GitHub repo here. See ya!

References

https://scikit-learn.org/dev/auto_examples/miscellaneous/plot_set_output.html#sphx-glr-auto-examples-miscellaneous-plot-set-output-py

--

--

Anderson Rici Amorim

Hello you all! I’m a Data Scientist and PhD in Computer Engineering. Love data and machine learning.