Encoding Categorical data in Machine Learning

Most of the Machine Learning Algorithms accepts only Numerical data as input. For example K-Nearest neighbor Algorithm calculates the Euclidean distance between the two observations, of a feature. To calculate the Euclidean distance between the two observations the input that is passed should be of Numerical type. So the Categorical data must be transformed or encoded into Numerical type before feeding data to an Algorithm, which in turn yields better results.

Akhil Reddy Mallidi
#ByCodeGarage
14 min readJun 23, 2019

--

Categorical data is of two types. Categorical data that are having any intrinsic ordering among themselves are called Ordinal type. Categorical data which don’t have any intrinsic ordering among themselves are called Nominal type.

Some examples of Nominal Categorical data are:

Some examples of Ordinal Categorical data are:

In this story we will discuss, various techniques to encode Categorical data which is one of the step of Data preprocessing that is to be performed before feeding data to the Machine Learning Models.

We will be working on car’s dataset to explore various techniques of Encoding Categorical data.

Download the car’s dataset by clicking here

Firstly, load all the necessary libraries required for our project. We will be loading Numpy and Pandas libraries initially. Where Numpy package is for performing Mathematical operations and Pandas package is for manipulating dataset.

The next step after importing all the basic required libraries, is to load our car’s dataset into pandas dataframe using pandas read_csv() method.

Let’s perform basic Data Exploration before tapping our foot directly into various techniques for encoding categorical data. Let’s start by viewing few observations of the dataset loaded using pandas head() method.

First few rows of car dataset that was loaded

From the first few rows, we noticed that the dataset comprised of both Numerical and Categorical type data. Since the dataset has nearly 26 features in it, some of the columns are not visible in this step. So let’s explore in more precise manner.

To know what are all the columns the dataset comprised of, use pandas dataframe.columns.

Columns present in the dataset

So from the above, there are columns like symboling, normalised-losses, make, fuel-type, aspiration, num-of-doors, body-style etc., in the dataset. Let’s see the datatype of all features in the dataset using pandas dtypes.

Datatypes of all features in the cars dataset

From exploring datatypes of each features in the cars dataset, we came to know that the dataset comprised of both Numerical and Categorical type.

To know more about the null values in the dataset and number of observations in the dataset use pandas info() method which returns general information about all features in the dataset.

General Information regarding the dataset

There are about 205 observations in the dataset and there are null values present in normalized-losses, bore, stroke, horsepower, peak-rpm, price columns which are Numerical type, so handling these all Numerical features are out of scope of this story. If you’re curious to know different methods to handle Missing Values in the dataset.

Check out my post on Handling Missing Values- A Comprehensive guide on Handling Missing Values

The Categorical feature having null values in the dataset is num-of-doors. It has nearly 2 null values in it. We will be discussing the methods to handle these null values later in this story.

Since Numerical features are out of scope of this post. We will be creating a new dataframe which includes only Categorical features using pandas select_dtypes() method and viewing first fews rows of new dataframe.

First few rows of new dataframe

Checking for null values in the modified dataframe using pandas isnull() method. It returns a boolean value as TRUE if any Null values are present and as FALSE if it is a Non-Null value.

Number of Null values in the modified dataframe

There are only two null values in num-of-doors feature in the modified dataframe. Null values in the Categorical features can be handled in two ways. One is imputing the most frequent value and the another is predicting the missing value by training the Machine Learning models like Classification, Regression etc with available data. To know more about Handling Missing values check out my post.

Now we are imputing those 2 Missing Values with the Most Frequent value in the num-of-doors column using pandas fillna() method.

After imputing all the Null values with the Most Frequent value, checking for any null values in the dataframe.

Number of null values in the num-of-doors column

Pandas get_dummies()

This is one of the approach and also an each one to encode Categorical data. pandas get_dummies() method takes categorical feature as an argument. Then it creates a Dummy Variable for every label in the feature, such that each dummy variable holds data as 1 or 0. 1 indicates the presence of a particular label and 0 indicates the absence of a particular label.

For example, if a feature contains labels as Male and Female. After applying pandas get_dummies() to that feature, Dummy variable for both Male and Female labels are created. i.e., Dummy variables are created for every label in the feature. So that those Dummy variables holds 1 in the presence of that label and as 0 in the absence of that label.

Sample data

Now we are applying pandas get_dummies() to fuel-type feature in the dataframe.

Value counts for fuel-type feature

In the above code snippet, we passed an argument drop_first=True, this indicates it drops one Dummy Variable from all the dummy variables created for a feature. It is to be dropped because it causes Dummy variable Trap, which is called as some of the features are highly correlated which results in predicting of another feature. So to avoid this situation, we are dropping one dummy variable created after encoding.

Let us view few rows of data after encoding fuel-type column using pandas sample() method.

Few rows of encoded data

So get_dummies() created an column fuel-type_gas, which holds data as 1 if it is gas and as 0 if it is diesel. Since we have dropped one column to avoid dummy variable trap, only one feature fuel-type_gas has been created.

Find / Replace

Another simplest way to encode Ordinal Categorical data, is to find the replace the value for each label, that should satisfy the intrinsic ordering among them.

Let’s replace the values in the num-of-doors feature using pandas replace() method.

Value counts for num-of-doors column

So there are only two labels in the num-of-doors columns. Such that there is an intrinsic ordering among themselves as four > two. So replacing four as 4 and two as 2 (4>2).

Let’s check the num-of-doors feature after replacing the values.

Value counts for num-of-doors column

If there are more number of labels in the feature to replace, then we can create a dictionary with keys a labels and can pass to the pandas replace() method.

First few rows of num-of-doors feature data

Label Encoder

Scikit Learn provides a lot of Encoders and Transformers, to encode Categorical data. One of those encoders is Label Encoder, which assigns a unique value or number to each label in the feature. Let’s start encoding make feature using Label Encoder.

Few rows of make and make_encoded features

From the above results when passed make feature as an argument to Label Encoder , nissan is encoded as 12 , mazda as 8 , mercedes-benz as 9 , mitsubishi as 11 and toyota as 19.

Unlike pandas get_dummies(), Label Encoder doesn’t creates any dummy variables, it encodes data into an Numerical type by assigning an unique value to each label. We can use Label Encoder and One Hot Encoder combinedly to encode data by creating dummy variables.

Label Binarizer

Label Binarizer is an SciKit Learn class that accepts Categorical data as input and returns an Numpy array. Unlike Label Encoder, it encodes the data into dummy variables indicating the presence of a particular label or not. Encoding make column data using Label Binarizer.

Let’s see the Dummy variables or labels that were created after encoding with Label Binarizer .

Classes that were created after encoding

Since the output from Label Binarizer is an Numpy array, we can convert numpy array to a pandas dataframe using pandas DataFrame() method.

Few rows of encoded make column

Now we can clearly see that every label in the make column has been created as an Dummy variable and they holds the data about the presence of particular label or not.

MultiLabel Binarizer

MultiLabel Binarizer works similar to Label Binarizer, but MultiLabel Binarizer is used when any feature containing records having Multi Label’s. Lets work with Multi Label Binarizer with sample data since our loaded dataset doesn’t have any records with Multi Label’s.

Sample data having Multi Labels

We have created an list having multi labels in each records. Now we are instantiating the MultiLabelBinarizer() and will pass the data to it.

Classes created after encoding

Converting the numpy array into a pandas dataframe and viewing the data.

Few rows of encoded data

Ordinal Encoder

Ordinal Encoder of SciKit Learn is used to encode categorical data into an Ordinal Integers. That means it transforms all categorical labels in a feature into where they can have any intrinsic ordering among them. The intrinsic ordering that was present among may not be true in all cases, one of them we will be seeing as an example below. So best choice is to go for replacing values and not this Ordinal Encoder, in case of Ordinal Categorical data.

Let’s create an sample ordinal categorical data and then we will apply Ordinal Encoder to that data to encode that data.

Few rows of sample dataframe
Output returned after encoding using Ordinal Encoder

From the above two results, we can observe that Low is encoded as 0 whereas High is encoded as 1 and Medium as 2 . Where intrinsic ordeing among the original data is as High > Medium > Low. Trying to write the same intrinsic ordering equation with encoded values is as 1 (High) > 2 (Medium) > 0 (Low) which is false mathematically. So try to avoid this for ordinal data.

Factorize Method

We can acheive the ordinal data encoding with proper ordering among themselves by creating an intrinsic ordering among labels using pandas Categorical() and the converting to integers using pandas factorize() method so that we can get the encoded data with proper ordering among themselves.

Ordering among the labels
Encoded data using pandas factorize method

Now we can see the original ordering retained in the encoded data such as Low encoded as 0 , Medium as 1 and High as 2 .

DictVectorizer

DictVectorizer of Scikit Learn library encodes the categorical data in such a way that it encodes every label in the feature into Dummy variables, which holds data regarding the presence of particular label or not. DictVectorizer is applicable only when data is in the form of dictonary of objects. Let’s work on sample data to encode categorical data using DictVectorizer. It returns Numpy array as an output.

Classes in the Encoded data

Converting the numpy array into an pandas dataframe and view few records to ensure that the categorical data that was passed as input is encoded or not.

Few rows of encoded data

ColumnTransformer

ColumnTransformer of SciKit Learn will transform the columns that are passed as an argument, where to which format the data is to be transformed (type of transformer) is also passed in the argument. Type of transformers like Normalizer, OneHotEncoder etc. We will be using OneHotEncoder as transformer to encode data. Let’s encode categorical data of drive-wheels and engine-location columns in loaded dataframe using ColumnTransformer.

Classes present in encoded data

Converting the numpy array into a pandas dataframe and viewing few rows of data.

First few rows of encoded dataframe

So that total three dummy variables are created for drive-wheels column. And also two dummy variables are created for engine_location column. One dummy variable from each are dropped to avoid dummy variable trap.

If a whole dataframe is encoded, then there will no issue directly it will be assigned to the dataframe itself. If only one or more than one columns are encoded as in the above step, the dataframe obtained as output and original dataframe are concateneted to continue further.

Few rows of data in the new dataframe

OneHotEncoder

OneHotEncoder of SciKit Learn encodes categorical data by creating Dummy variables for each label in the feature that was passed as an argument. It accepts only Numerical data as input. So the categorical data that needs to be encoded is converted into Numerical type by using LabelEncoder. Then passing it to the OneHotEncoder object and the output will be an Numpy array. It was one of the most preferred method.

Let’s encode categorical data of aspiration column in the dataframe using OneHotEncoder.

Classes in encoded aspiration column

The data in the aspiration column is converted into Numerical type using LabelEncoder. Now pass this data to the OneHotEncoder object.

First few rows of encoded aspiration dataframe

We might get DeprecationWarning as OneHotEncoder class will be removed in future and alternative to that is we can use ColumnTransformer with OneHotEncoder as transformer to encode categorical data.

Now skipping the concatenation step by keeping the length of the post in mind.

We can also encode all categorical columns at one using OneHotEncoder in the following way.

Categorical columns in the dataframe

Now pass these categorical columns to LabelEncoder to convert the data present in those columns to a Numerical type.

Few rows of data after LabelEncoding

Now pass this three columns to OneHotEncoder object to get an encoded numpy array output.

Numpy array after performing OneHotEncoding

Hurrah..!! Those were the techniques for encoding Categorical data. By concatenating the encoded dataframe’s and original dataframe properly and also taking proper care on dummy variable trap, we can obtain the encoded dataframe. By that data we can start training machine learning models.

GitHub Link of the Repository- https://github.com/itzzmeakhi/Medium/tree/master/EncodingCategoricalData

Let me know if you have anything to ask. Do share the story and clap it if you liked it.

Know more about me- itzzmeakhi.me

--

--

Akhil Reddy Mallidi
#ByCodeGarage

I seek out new knowledge and actively develop new skills. Loves to write. (http://www.itzzmeakhi.dev)