Member-only story
Getting Started
How to Get Feature Importances from Any Sklearn Pipeline
Pipelines can be hard to navigate here’s some code that works in general.
Introduction
Pipelines are amazing! I use them in basically every data science project I work on. But, easily getting the feature importance is way more difficult than it needs to be. In this tutorial, I’ll walk through how to access individual feature names and their coefficients from a Pipeline. After that, I’ll show a generalized solution for getting feature importance for just about any pipeline.
Pipelines
Let’s start with a super simple pipeline that applies a single featurization step followed by a classifier.
from datasets import list_datasets, load_dataset, list_metrics
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
# Load a dataset and print the first examples in the training set
imdb_data = load_dataset('imdb')
classifier = svm.LinearSVC(C=1.0, class_weight="balanced")
model = Pipeline(
[
("vectorizer", TfidfVectorizer()),
("classifier", classifier),
]
)
x_train = [x["text"]for x in imdb_data["train"]]
y_train = [x["label"]for x in…