The new ColumnTransformer will change workflows from Pandas to Scikit-Learn

From Pandas to Scikit-Learn — A new exciting workflow

Ted Petrou
Sep 3, 2018 · 21 min read

Become an Expert

Scikit-Learn’s new integration with Pandas

Summary and goals of this article

A note before we get started

Continuing…

Upgrading to version 0.20

conda update scikit-learn
pip install -U scikit-learn

Introducing ColumnTransformer and the upgraded OneHotEncoder

Kaggle Housing Dataset

Inspect the data

>>> import pandas as pd
>>> import numpy as np
>>> train = pd.read_csv(‘data/housing/train.csv’)
>>> train.head()
>>> train.shape
(1460, 81)

Remove the target variable from the training set

>>> y = train.pop('SalePrice').values

Encoding a single string column

>>> vc = train['HouseStyle'].value_counts()
>>> vc
1Story 726
2Story 445
1.5Fin 154
SLvl 65
SFoyer 37
1.5Unf 14
2.5Unf 11
2.5Fin 8
Name: HouseStyle, dtype: int64

Scikit-Learn Gotcha — Must have 2D data

>>> hs_train = train[['HouseStyle']].copy()
>>> hs_train.ndim
2

Import, Instantiate, Fit — The three-step process for each estimator

>>> from sklearn.preprocessing import OneHotEncoder
>>> ohe = OneHotEncoder(sparse=False)
>>> hs_train_transformed = ohe.fit_transform(hs_train)
>>> hs_train_transformed
array([[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 1., 0., 0.],
...,
[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.]])
>>> hs_train_transformed.shape(1460, 8)

We have a NumPy array. Where are the column names?

>>> feature_names = ohe.get_feature_names()
>>> feature_names
array(['x0_1.5Fin', 'x0_1.5Unf', 'x0_1Story', 'x0_2.5Fin',
'x0_2.5Unf', 'x0_2Story', 'x0_SFoyer', 'x0_SLvl'], dtype=object)

Verifying our first row of data is correct

>>> row0 = hs_train_transformed[0]
>>> row0
array([0., 0., 0., 0., 0., 1., 0., 0.])
>>> feature_names[row0 == 1]array(['x0_2Story'], dtype=object)
>>> hs_train.values[0]array(['2Story'], dtype=object)

Use inverse_transform to automate this

>>> ohe.inverse_transform([row0])array([['2Story']], dtype=object)
>>> hs_inv = ohe.inverse_transform(hs_train_transformed)
>>> hs_inv
array([['2Story'],
['1Story'],
['2Story'],
...,
['2Story'],
['1Story'],
['1Story']], dtype=object)
>>> np.array_equal(hs_inv, hs_train.values)True

Applying a transformation to the test set

>>> test = pd.read_csv('data/housing/test.csv')
>>> hs_test = test[['HouseStyle']].copy()
>>> hs_test_transformed = ohe.transform(hs_test)
>>> hs_test_transformed
array([[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 1., 0., 0.],
...,
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 1., 0.],
[0., 0., 0., ..., 1., 0., 0.]])
>>> hs_test_transformed.shape
(1459, 8)

Trouble area #1 — Categories unique to the test set

>>> hs_test = test[['HouseStyle']].copy()
>>> hs_test.iloc[0, 0] = '3Story'
>>> hs_test.head(3)
HouseStyle
0 3Story
1 1Story
2 2Story
>>> ohe.transform(hs_test)ValueError: Found unknown categories ['3Story'] in column 0 during transform

Error: Unknown Category

>>> ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
>>> ohe.fit(hs_train)
>>> hs_test_transformed = ohe.transform(hs_test)
>>> hs_test_transformed
array([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 1., 0., 0.],
...,
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 1., 0.],
[0., 0., 0., ..., 1., 0., 0.]])
>>> hs_test_transformed[0]array([0., 0., 0., 0., 0., 0., 0., 0.])

Trouble area #2 — Missing Values in test set

>>> hs_test = test[['HouseStyle']].copy()
>>> hs_test.iloc[0, 0] = np.nan
>>> hs_test.iloc[1, 0] = None
>>> hs_test.head(4)
HouseStyle
0 NaN
1 None
2 2Story
3 2Story
>>> hs_test_transformed = ohe.transform(hs_test)
>>> hs_test_transformed[:4]
array([[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0.]])

Trouble area #3 — Missing Values in training set

>>> hs_train = hs_train.copy()
>>> hs_train.iloc[0, 0] = np.nan
>>> ohe = OneHotEncoder(sparse=False, handle_unknown='ignore')
>>> ohe.fit_transform(hs_train)
TypeError: '<' not supported between instances of 'str' and 'float'

Must impute missing values

>>> hs_train = train[['HouseStyle']].copy()
>>> hs_train.iloc[0, 0] = np.nan
>>> from sklearn.impute import SimpleImputer
>>> si = SimpleImputer(strategy='constant', fill_value='MISSING')
>>> hs_train_imputed = si.fit_transform(hs_train)
>>> hs_train_imputed
array([['MISSING'],
['1Story'],
['2Story'],
...,
['2Story'],
['1Story'],
['1Story']], dtype=object)
>>> hs_train_transformed = ohe.fit_transform(hs_train_imputed)
>>> hs_train_transformed
array([[0., 0., 0., ..., 1., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.]])
>>> hs_train_transformed.shape(1460, 9)>>> ohe.get_feature_names()array(['x0_1.5Fin', 'x0_1.5Unf', 'x0_1Story', 'x0_2.5Fin',
'x0_2.5Unf', 'x0_2Story', 'x0_MISSING', 'x0_SFoyer',
'x0_SLvl'], dtype=object)

More on fit_transform

Apply both transformations to the test set

>>> hs_test = test[['HouseStyle']].copy()
>>> hs_test.iloc[0, 0] = 'unique value to test set'
>>> hs_test.iloc[1, 0] = np.nan
>>> hs_test_imputed = si.transform(hs_test)
>>> hs_test_transformed = ohe.transform(hs_test_imputed)
>>> hs_test_transformed.shape
(1459, 8)>>> ohe.get_feature_names()array(['x0_1.5Fin', 'x0_1.5Unf', 'x0_1Story', 'x0_2.5Fin',
'x0_2.5Unf', 'x0_2Story', 'x0_SFoyer', 'x0_SLvl'],
dtype=object)

Use a Pipeline instead

>>> from sklearn.pipeline import Pipeline
>>> si_step = ('si', SimpleImputer(strategy='constant',
fill_value='MISSING'))
>>> ohe_step = ('ohe', OneHotEncoder(sparse=False,
handle_unknown='ignore'))
>>> steps = [si_step, ohe_step]
>>> pipe = Pipeline(steps)
>>> hs_train = train[['HouseStyle']].copy()
>>> hs_train.iloc[0, 0] = np.nan
>>> hs_transformed = pipe.fit_transform(hs_train)
>>> hs_transformed.shape
(1460, 9)
>>> hs_test = test[['HouseStyle']].copy()
>>> hs_test_transformed = pipe.transform(hs_test)
>>> hs_test_transformed.shape
(1459, 9)

Why just the transform method for the test set?

Transforming Multiple String Columns

>>> string_cols = ['RoofMatl', 'HouseStyle']
>>> string_train = train[string_cols]
>>> string_train.head(3)
RoofMatl HouseStyle
0 CompShg 2Story
1 CompShg 1Story
2 CompShg 2Story
>>> string_train_transformed = pipe.fit_transform(string_train)
>>> string_train_transformed.shape
(1460, 16)

Get individual pieces of the pipeline

>>> ohe = pipe.named_steps['ohe']
>>> ohe.get_feature_names()
array(['x0_ClyTile', 'x0_CompShg', 'x0_Membran', 'x0_Metal',
'x0_Roll', 'x0_Tar&Grv', 'x0_WdShake', 'x0_WdShngl',
'x1_1.5Fin', 'x1_1.5Unf', 'x1_1Story', 'x1_2.5Fin',
'x1_2.5Unf', 'x1_2Story', 'x1_SFoyer', 'x1_SLvl'],
dtype=object)

Use the new ColumnTransformer to choose columns

('name', SomeTransformer(parameters), columns)

Pass a Pipeline to the ColumnTransformer

>>> from sklearn.compose import ColumnTransformer>>> cat_si_step = ('si', SimpleImputer(strategy='constant',
fill_value='MISSING'))
>>> cat_ohe_step = ('ohe', OneHotEncoder(sparse=False,
handle_unknown='ignore'))
>>> cat_steps = [cat_si_step, cat_ohe_step]
>>> cat_pipe = Pipeline(cat_steps)
>>> cat_cols = ['RoofMatl', 'HouseStyle']
>>> cat_transformers = [('cat', cat_pipe, cat_cols)]
>>> ct = ColumnTransformer(transformers=cat_transformers)

Pass the entire DataFrame to the ColumnTransformer

>>> X_cat_transformed = ct.fit_transform(train)
>>> X_cat_transformed.shape
(1460, 16)
>>> X_cat_transformed_test = ct.transform(test)
>>> X_cat_transformed_test.shape
(1459, 16)

Retrieving the feature names

>>> pl = ct.named_transformers_['cat']
>>> ohe = pl.named_steps['ohe']
>>> ohe.get_feature_names()
array(['x0_ClyTile', 'x0_CompShg', 'x0_Membran', 'x0_Metal',
'x0_Roll','x0_Tar&Grv', 'x0_WdShake', 'x0_WdShngl',
'x1_1.5Fin', 'x1_1.5Unf', 'x1_1Story', 'x1_2.5Fin',
'x1_2.5Unf', 'x1_2Story', 'x1_SFoyer', 'x1_SLvl'],
dtype=object)

Transforming the numeric columns

Using all the numeric columns

>>> train.dtypes.head()Id               int64
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
dtype: object
>>> kinds = np.array([dt.kind for dt in train.dtypes])
>>> kinds[:5]
array(['i', 'i', 'O', 'f', 'i'], dtype='<U1')
>>> all_columns = train.columns.values
>>> is_num = kinds != 'O'
>>> num_cols = all_columns[is_num]
>>> num_cols[:5]
array(['Id', 'MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual'],
dtype=object)
>>> cat_cols = all_columns[~is_num]
>>> cat_cols[:5]
array(['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour'],
dtype=object)
>>> from sklearn.preprocessing import StandardScaler>>> num_si_step = ('si', SimpleImputer(strategy='median'))
>>> num_ss_step = ('ss', StandardScaler())
>>> num_steps = [num_si_step, num_ss_step]
>>> num_pipe = Pipeline(num_steps)
>>> num_transformers = [('num', num_pipe, num_cols)]
>>> ct = ColumnTransformer(transformers=num_transformers)
>>> X_num_transformed = ct.fit_transform(train)
>>> X_num_transformed.shape
(1460, 37)

Combining both categorical and numerical column transformations

>>> transformers = [('cat', cat_pipe, cat_cols),
('num', num_pipe, num_cols)]
>>> ct = ColumnTransformer(transformers=transformers)
>>> X = ct.fit_transform(train)
>>> X.shape
(1460, 305)

Machine Learning

>>> from sklearn.linear_model import Ridge>>> ml_pipe = Pipeline([('transform', ct), ('ridge', Ridge())])
>>> ml_pipe.fit(train, y)
>>> ml_pipe.score(train, y)0.92205

Cross-Validation

>>> from sklearn.model_selection import KFold, cross_val_score
>>> kf = KFold(n_splits=5, shuffle=True, random_state=123)
>>> cross_val_score(ml_pipe, train, y, cv=kf).mean()
0.813

Selecting parameters when Grid Searching

>>> from sklearn.model_selection import GridSearchCV>>> param_grid = {
'transform__num__si__strategy': ['mean', 'median'],
'ridge__alpha': [.001, 0.1, 1.0, 5, 10, 50, 100, 1000],
}
>>> gs = GridSearchCV(ml_pipe, param_grid, cv=kf)
>>> gs.fit(train, y)
>>> gs.best_params_
{'ridge__alpha': 10, 'transform__num__si__strategy': 'median'}>>> gs.best_score_0.819

Getting all the grid search results in a Pandas DataFrame

>>> pd.DataFrame(gs.cv_results_)
Lots of data from each combination of the parameter grid

Building a custom transformer that does all the basics

Low-frequency strings

Writing your own estimator class

from sklearn.base import BaseEstimatorclass BasicTransformer(BaseEstimator):

def __init__(self, cat_threshold=None, num_strategy='median',
return_df=False):
# store parameters as public attributes
self.cat_threshold = cat_threshold

if num_strategy not in ['mean', 'median']:
raise ValueError('num_strategy must be either "mean" or
"median"')
self.num_strategy = num_strategy
self.return_df = return_df

def fit(self, X, y=None):
# Assumes X is a DataFrame
self._columns = X.columns.values

# Split data into categorical and numeric
self._dtypes = X.dtypes.values
self._kinds = np.array([dt.kind for dt in X.dtypes])
self._column_dtypes = {}
is_cat = self._kinds == 'O'
self._column_dtypes['cat'] = self._columns[is_cat]
self._column_dtypes['num'] = self._columns[~is_cat]
self._feature_names = self._column_dtypes['num']

# Create a dictionary mapping categorical column to unique
# values above threshold
self._cat_cols = {}
for col in self._column_dtypes['cat']:
vc = X[col].value_counts()
if self.cat_threshold is not None:
vc = vc[vc > self.cat_threshold]
vals = vc.index.values
self._cat_cols[col] = vals
self._feature_names = np.append(self._feature_names, col
+ '_' + vals)

# get total number of new categorical columns
self._total_cat_cols = sum([len(v) for col, v in
self._cat_cols.items()])

# get mean or median
num_cols = self._column_dtypes['num']
self._num_fill = X[num_cols].agg(self.num_strategy)
return self

def transform(self, X):
# check that we have a DataFrame with same column names as
# the one we fit
if set(self._columns) != set(X.columns):
raise ValueError('Passed DataFrame has different columns
than fit DataFrame')
elif len(self._columns) != len(X.columns):
raise ValueError('Passed DataFrame has different number
of columns than fit DataFrame')

# fill missing values
num_cols = self._column_dtypes['num']
X_num = X[num_cols].fillna(self._num_fill)

# Standardize numerics
std = X_num.std()
X_num = (X_num - X_num.mean()) / std
zero_std = np.where(std == 0)[0]

# If there is 0 standard deviation, then all values are the
# same. Set them to 0.
if len(zero_std) > 0:
X_num.iloc[:, zero_std] = 0
X_num = X_num.values

# create separate array for new encoded categoricals
X_cat = np.empty((len(X), self._total_cat_cols),
dtype='int')
i = 0
for col in self._column_dtypes['cat']:
vals = self._cat_cols[col]
for val in vals:
X_cat[:, i] = X[col] == val
i += 1

# concatenate transformed numeric and categorical arrays
data = np.column_stack((X_num, X_cat))

# return either a DataFrame or an array
if self.return_df:
return pd.DataFrame(data=data,
columns=self._feature_names)
else:
return data

def fit_transform(self, X, y=None):
return self.fit(X).transform(X)

def get_feature_names():
return self._feature_names

Using our BasicTransformer

>>> bt = BasicTransformer(cat_threshold=3, return_df=True)
>>> train_transformed = bt.fit_transform(train)
>>> train_transformed.head(3)
Columns of the DataFrame where the numerical and categorical columns meet

Using our transformer in a pipeline

>>> basic_pipe = Pipeline([('bt', bt), ('ridge', Ridge())])
>>> basic_pipe.fit(train, y)
>>> basic_pipe.score(train, y)
0.904
>>> cross_val_score(basic_pipe, train, y, cv=kf).mean()0.816
>>> param_grid = {
'bt__cat_threshold': [0, 1, 2, 3, 5],
'ridge__alpha': [.1, 1, 10, 100]
}
>>> gs = GridSearchCV(p, param_grid, cv=kf)
>>> gs.fit(train, y)
>>> gs.best_params_
{'bt__cat_threshold': 0, 'ridge__alpha': 10}>>> gs.best_score_
0.830

Binning and encoding numeric columns with the new KBinsDiscretizer

>>> from sklearn.preprocessing import KBinsDiscretizer
>>> kbd = KBinsDiscretizer(encode='onehot-dense')
>>> year_built_transformed = kbd.fit_transform(train[['YearBuilt']])
>>> year_built_transformed
array([[0., 0., 0., 0., 1.],
[0., 0., 1., 0., 0.],
[0., 0., 0., 1., 0.],
...,
[1., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0.]])
>>> year_built_transformed.sum(axis=0)array([292., 274., 307., 266., 321.])
>>> kbd.bin_edges_array([array([1872. , 1947.8, 1965. , 1984. , 2003. , 2010. ])],
dtype=object)

Processing all the year columns separately with ColumnTransformer

>>> year_cols = ['YearBuilt', 'YearRemodAdd', 'GarageYrBlt', 
'YrSold']
>>> not_year = ~np.isin(num_cols, year_cols + ['Id'])
>>> num_cols2 = num_cols[not_year]
>>> year_si_step = ('si', SimpleImputer(strategy='median'))
>>> year_kbd_step = ('kbd', KBinsDiscretizer(n_bins=5,
encode='onehot-dense'))
>>> year_steps = [year_si_step, year_kbd_step]
>>> year_pipe = Pipeline(year_steps)
>>> transformers = [('cat', cat_pipe, cat_cols),
('num', num_pipe, num_cols2),
('year', year_pipe, year_cols)]
>>> ct = ColumnTransformer(transformers=transformers)
>>> X = ct.fit_transform(train)
>>> X.shape
(1460, 320)
>>> ml_pipe = Pipeline([('transform', ct), ('ridge', Ridge())])
>>> cross_val_score(ml_pipe, train, y, cv=kf).mean()
0.813

More goodies in Scikit-Learn 0.20

Conclusion

Get the All Access Pass!

Dunder Data

Expert data science training — take a course at dunderdata.com

Ted Petrou

Written by

Author of Master Data Analysis with Python and Founder of Dunder Data

Dunder Data

Expert data science training — take a course at dunderdata.com