Handling Heterogeneous Features in a dataset using ColumnTransformer

Published in

Analytics Vidhya

5 min readMay 18, 2020

In the real world, provided datasets may have feature columns consisting of an unstructured text column, categorical and numerical columns. You would be required to perform feature selection and then you would be interested in passing these different datatypes columns to train your dataset. Here, in this article, I am going to talk about the same scenario where after performing feature selection on a dataset, I had to work with categorical columns, numerical columns, and an unstructured text column.

The first challenge was how to feed different types of features together. For that, I have used the ColumnTransformer class. It’s a class provided in the scikit-learn 0.20 version.

Let us take a look at the code.

Here I am using a Kaggle dataset as a sample for this article. (Below approach is just to use column transformer on this dataset and not trying to solve this Kaggle problem).

Women’s E-Commerce Clothing Reviews

23,000 Customer Reviews and Ratings

www.kaggle.com

df = pd.read_csv(‘/Womens Clothing E-Commerce Reviews.csv’)

Let us perform common feature engineering steps.

1. Removing unnamed column 0.

2. Removing null rows from ‘Class name’, ‘Review Text’, and ‘Title’.

df.drop(df.columns[0], axis=1, inplace=True)
df = df[~df['Class Name'].isnull()]
df = df[~df['Review Text'].isnull()]
df = df[~df['Title'].isnull()]

let’s see how df looks now

Let’s find out the data type of each column in df.

Clearly, df has heterogeneous data types. Now, suppose I am keen to use all these features in the model. So, before feeding them to the model, features need to be transformed into a refined version. For numeric features, I need to scale them using standardScaler; for categorical ones, perform one-hot encoding and use countvectorizer for the text columns. We can achieve all this using the ColumnTransformer class and also feed them together into the model.

ColumnTransformer class can take up to 6 parameters but in this article, I would be using only two of its parameters — transformers and remainder.

Let’s talk first about transformers.

‘transformers’ is a list of tuples specifying the transformer objects to be applied to subsets of the data. Its represented in either of the below-mentioned formats.

Multiple columns → (Name , Transformer , [columns])

Single Text Column →(Name, Transformer, column)

where

Name: type str

Used for setting params using set_params and helps in searching in a grid search.

Transformer{‘drop’, ‘passthrough’}, or estimator

The estimator must support fit and transform. I would be using StandardScaler for numeric, CountVectorizer for unstructured text column, and OneHotEncoder for the categorical column.

[columns]

List of data frame columns you want to perform transformations on.

You can read more about this from scikit-learn.org:- https://scikit-learn.org/stable/modules/generated/sklearn.compose.ColumnTransformer.html#sklearn.compose.ColumnTransformer

For my dataset, I need to pass the below-mentioned transformers in the parameter of the transformers of the ColumnTransfomer function.

SimpleImputer with constant imputing to categorical columns [‘Division Name’,’Department Name’]
CountVectorizer with STOPWORDS on ‘Review Text’ and just CountVectorizer on the ‘Title’ column.
SimpleImputer with median imputing for numerical columns [‘Clothing ID’,’Age’,’Rating’,‘Recommended IND’,‘Positive Feedback Count’].

STOPWORDS = set(stopwords.words('english'))catTransformer = Pipeline(steps=[('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])textTransformer_0 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=r"(?u)\b\w+\b",\
                                 stop_words=STOPWORDS))])textTransformer_1 = Pipeline(steps=[('text_bow1', CountVectorizer())])numeric_features = ['Clothing ID','Age','Rating','Recommended IND','Positive Feedback Count']numTransformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),('scaler', StandardScaler())])ct = ColumnTransformer(
transformers=[
('cat', catTransformer, ['Division Name','Department Name']),
('num', numTransformer, numeric_features),
('text1', textTransformer_0, 'Review Text'),
('text2', textTransformer_1, 'Title')
])

Splitting the dataset and fitting ColumnTransformer created above to Pipeline.

X = df.drop('Class Name', axis='columns')y=df['Class Name']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,random_state = 42)pipeline = Pipeline(steps=[('feature_engineer', ct),('RF', RandomForestClassifier(n_jobs=-1, class_weight='balanced'))])pipeline.fit(X_train, y_train)preds = pipeline.predict(X_test)print('accuracy %s' % accuracy_score(preds, y_test))
print(confusion_matrix(y_test, preds))
print(classification_report(y_test, preds))

I have now successfully created a pipeline and predicted the accuracy of the model by passing heterogeneous features defining their transformations.

Now let’s discuss the second parameter ‘remainder’. This parameter is used to handle the remaining columns from the dataset for which no transformations are defined. The parameter has **‘drop’ as its default value which means that if I’m not specifying any value to the parameter then the remaining columns would be dropped from the dataset; other value being *‘passthrough’. Using each value for this parameter would also impact the accuracy of the model. In this case, I got better accuracy by using ‘passthrough’ value i.e. Case1.

For instance, I do not want to transform any specific columns from the dataset and want to feed them as it is. Taking the ‘Rating’ column here to demonstrate both the values of the parameter. To make the ‘Rating’ column eligible for the ‘remainder’ parameter, I’ll not be applying numeric transformations to it.

Case1

* with remainder = ‘passthrough’.

STOPWORDS = set(stopwords.words('english'))catTransformer = Pipeline(steps=[('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])textTransformer_0 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=r"(?u)\b\w+\b",\
                                 stop_words=STOPWORDS))])textTransformer_1 = Pipeline(steps=[('text_bow1', CountVectorizer())])numeric_features = ['Clothing ID','Age','Recommended IND','Positive Feedback Count']numTransformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),('scaler', StandardScaler())])ct = ColumnTransformer(transformers=[('cat', catTransformer, ['Division Name','Department Name']),('num', numTransformer, numeric_features),('text1', textTransformer_0, 'Review Text'),('text2', textTransformer_1, 'Title')],remainder='passthrough')

Case2

** Not setting value for the ‘remainder’ parameter i.e. its default value ‘drop’ will be considered by the ColumnTransformer class and hence the ‘Rating’ column should get dropped from the input dataset.

STOPWORDS = set(stopwords.words('english'))catTransformer = Pipeline(steps=[('cat_imputer', SimpleImputer(strategy='constant', fill_value='missing')),('cat_ohe', OneHotEncoder(handle_unknown='ignore'))])textTransformer_0 = Pipeline(steps=[
    ('text_bow', CountVectorizer(lowercase=True,\
                                 token_pattern=r"(?u)\b\w+\b",\
                                 stop_words=STOPWORDS))])textTransformer_1 = Pipeline(steps=[('text_bow1', CountVectorizer())])numeric_features = ['Clothing ID','Age','Recommended IND','Positive Feedback Count']numTransformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')),('scaler', StandardScaler())])ct = ColumnTransformer(transformers=[
('cat', catTransformer, ['Division Name','Department Name']),
('num', numTransformer, numeric_features),
('text1', textTransformer_0, 'Review Text'),
('text2', textTransformer_1, 'Title')])

Summary

In this article, we learned how to pass various heterogeneous datatype columns with their specific transformations using ColumnTransformer.

Demonstrated:

Handling numeric, categorical and unstructured text columns
remainder parameter with ‘passthrough’ and ‘drop’ values

I hope you enjoyed reading this article!!

Handling Heterogeneous Features in a dataset using ColumnTransformer

Women’s E-Commerce Clothing Reviews

23,000 Customer Reviews and Ratings

Written by Gaurav Mittal