A brief view of machine learning pipeline in python

Hannah Yan Han
1 min readJun 6, 2017

--

As I step out of R’s comfort zone and venture into Python land, I find pipeline in scikit-learn useful to understand before moving on to more advanced or automated algorithms.

A pipeline is what chains several steps together, once the initial exploration is done. For example, some codes are meant to transform features — normalise numericals, or turn text into vectors, or fill up missing data, they are transformers; other codes are meant to predict variables by fitting an algorithm, such as random forest or support vector machine, they are estimators. Pipeline chains all these together which can then be applied to training data en bloc.

Example of a pipeline that imputes data with most frequent value of each column, and then fit to a decision tree classifier.

from sklearn.pipeline import Pipeline
steps = [('imputation', Imputer(missing_values='NaN', strategy = 'most_frequent', axis=0)),
('clf', DecisionTreeClassifier())]
pipeline = Pipeline(steps)
clf = pipeline.fit(X_train,y_train)

Instead of fitting to one model, it can be looped over several models to find the best one.

classifiers = [
KNeighborsClassifier(5),
RandomForestClassifier(),
GradientBoostingClassifier()]
for clf in classifiers:
steps = [('imputation', Imputer(missing_values='NaN', strategy = 'most_frequent', axis=0)),
('clf', clf)]
pipeline = Pipeline(steps)

I also learnt the pipeline itself can be used as an estimator and passed to cross validation or gridsearch.

from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
kfold = KFold(n_splits=10, random_state=seed)
results = cross_val_score(pipeline, X_train, y_train, cv=kfold)
print(results.mean())

This is #day35 of my #100dayprojects on data science and visual storytelling. Thanks for reading and feedbacks are welcomed.

--

--