Categorical Encoding with Pandas: get_dummies

Samuel Kehinde Ayo
Analytics Vidhya
Published in
4 min readSep 17, 2021

In this Article we will look at a very powerful yet simple categorical encoding technique with Pandas.

Simply put, Pandas is a computing library for data frame manipulation.

In the Data wrangling stage of the data processing pipeline, encoding categorical variables is very important if not crucial this is because machine learning models are mathematical models that use algorithms that work with numerical data types, and neural networks also work with numerical data types.

This is why, we need encoding methods to convert non-numerical data to meaningful numerical data. For this we look at Pandas get_dummies method.

get_dummies is one of the easiest way to implement one hot encoding method and it has very useful parameters, of which we will mention the most important ones.

You can perform hot encoding in just one row with get_dummies.

We will using a salary dataset for this demo, download here.

The objective of this data science process is to predict the salary of individuals based off other features. We will use Linear Regression for this data, but the data is not ready for the machine learning model. If How do we determine this, we’ll use pandas info() method to have a descriptive look at the data.

Pandas uses the object data type to indicate categorical variables/columns because there are categorical (non-numerical) columns and we need to transform them.

For this, we will implement get_dummies.

What get_dummies does is, it creates a one-hot encoded matrix for every target column we specify but what about label encoding and when do I use which for which?

We apply OHE(one hot encoding):

When the values that are close to each other in the label encoding correspond to target values that aren’t close (non — linear data).

When the categorical feature is not ordinal (dog,cat,mouse).

We apply Label encoding (Le) when:

The categorical feature is ordinal (Jr. kg, Sr. kg, Primary school, high school, etc).

Let’s get on!

Since we have two categorical columns, we need to perform encoding for them separately.

Let’s begin with the country column:

First we make a copy of our data, then we check for unique values in the country column

The “country” column has 4 unique values, which means we will get 4 columns after applying get_dummies().

With this syntax we can apply get_dummies to a column of dataframe;

static = pd.get_dummies(train_data[‘country’],prefix_sep=’_’,prefix=’country’)

static = a variable name to hold our new dataframe

train_data[‘country’] = target categorical column from our dataset

prefix_sep = prefix separator parameter for clean column name

prefix = prefix values for our new columns

As you can see we got a 4 column dataframe after get_dummies() with 120 rows.

We can see the names of the columns as values from the train_data country column.

The unique values were used to form new columns and related to the other values (now columns) using 1 and 0 matrix.

Dummies with drop_first=True parameter

Dummies with drop_first=True parameter can be used to drop the first column. This leaves us with 3 columns. Normally the default value of this parameter is ‘False’, we just set it to ‘True’. Let’s see how it works.

static = pd.get_dummies(train_data[‘country’],prefix_sep=’_’,prefix=’country’,drop_first = True)

It removes the first column of the get_dummies() dataframe. The first column for the “static” column is country_britain. If the country is britain, all columns are 0. When all columns are 0, the model knows the country is britain.

Check out the example below.

Our original data frame, train_data keeps its shape. We must merge these dataframes.

We can merge them in two ways using join or concat.

Concat:

We will concatenate static and train_data into one data frames using the concat method and then drop the “country” column, we don’t need it anymore because it’s not numerical.

We concat on columns by stating axis=1, this will place our dataframes side by side.

Removing the error of stacking vertically, increasing the rows and creating missing columns and outliers.

Join:

We can merge them into data frames using the join method but we should drop the “country” column just like when we concat, because it’s not numerical.

We will join the columns by stating axis=1, this will place our dataframes side by side.

All columns are numeric. Our data is now ready for the model.

Conclusion

Pandas get_dummies is amazing hot encoding technique to put a touch on data before modeling. It is arguably the easiest way and it has many parameters that make our model more readable and smoother.

--

--

Samuel Kehinde Ayo
Analytics Vidhya

Data Scientist | AI engineer | Senior Software engineer