Useful Scikit-learn tips -1

Column Transformer and Column Selector

Bala Priya C
Oct 15 · 3 min read
Image for post
Image for post
Photo by Tim Mossholder on Unsplash

This series of blog posts are inspired by Kevin Markham, Founder of Data School’s videos on cool scikit-learn tips.

More often than not, we come across datasets where the columns are all of the different types, some could be categorical variables, some numerical. Clearly, we require different pre-processing strategies to encode categorical variables and to impute the missing values of numerical data while some columns could be retained as such as they have data just the way we need.

Here’s a cool tip to apply different pre-processing techniques to different columns😎. You’d need Scikit-learn version 0.20 and later to use this feature.

Let’s import the dataset that we need (The famous Titanic dataset from Kaggle😊 )

# Read in a subset of the DataFrame containing columns of interest
cols = ['Fare', 'Embarked', 'Sex', 'Age']
X = df[cols]

Our Data Frame X looks like this

Image for post
Image for post
The Data Frame X

Let’s inspect the Data frame X; There are two columns with categorical variables which should be one-hot encoded, there’s an ‘Age’ column with missing values which have to be imputed and a ‘Fare’ column with no missing values.

# Instantiate the One-Hot Encoder and Imputer
ohe = OneHotEncoder()
imp = SimpleImputer()

We can now use the make_column_transformer function to apply the necessary pre-processing to the necessary columns

# column order: Embarked (3 columns), Sex (2 columns), Age (1 column), Fare (1 column)
ct.fit_transform(X)
Image for post
Image for post
Output after pre-processing

We now see that the columns ‘Embarked’ and ‘Sex’ has been One-Hot Encoded and the missing value in the ‘Age’ column has been replaced with the mean of the other values (mean imputation); setting the argument remainder to passthrough ensures that we pass the other columns as such as they do not require any encoding or imputing strategy.

The code used above can be found in this GitHub repo and the video can be found on YouTube

In the above example, we’ve selected columns by name, but there are several other ways to do it too. Let’s look at them now in the following code snippet.

# Choose by Column Names
ct = make_column_transformer((ohe, ['Embarked', 'Sex']))
# Choose by integer positions
ct = make_column_transformer((ohe, [1, 2]))
# Alternatively, we could use slicing
ct = make_column_transformer((ohe, slice(1, 3)))
# Use Boolean mask to choose columns
# True -> Include Column
# False -> Exclude Column
ct = make_column_transformer((ohe, [False, True, True, False]))

In Scikit-learn version 0.22 and later, there’s another function that we can use to choose the columns to which we would like to apply the particular encoding strategy, illustrated in the following code snippet

# apply to all object type columns
ct = make_column_transformer((ohe, make_column_selector(dtype_include=object)))
# apply to all non-numerical columns
ct = make_column_transformer((ohe, make_column_selector(dtype_exclude='number')))
# one-hot encode Embarked and Sex (and drop all other columns)
ct.fit_transform(X)
Image for post
Image for post
Output after transformation

Let’s note that the argument remainder takes drop as the default value and hence when not explicitly specified, the remaining columns are dropped.

The above code can be found in this GitHub repo and the video is on YouTube.

Happy Learning✨! Until next time 😊

Nerd For Tech

From Confusion to Clarification

Bala Priya C

Written by

Math,Signal Processing & Machine Learning Enthusiast| Passionate about Women in Tech and Diversity & Inclusion

Nerd For Tech

We are tech nerds because we believe in reinventing the world with the power of Technology. Our articles talk about some of the most disruptive ideas, technology, and innovation.

Bala Priya C

Written by

Math,Signal Processing & Machine Learning Enthusiast| Passionate about Women in Tech and Diversity & Inclusion

Nerd For Tech

We are tech nerds because we believe in reinventing the world with the power of Technology. Our articles talk about some of the most disruptive ideas, technology, and innovation.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store