Encoding Categorical data in Machine Learning

Most of the Machine Learning Algorithms accepts only Numerical data as input. For example K-Nearest neighbor Algorithm calculates the Euclidean distance between the two observations, of a feature. To calculate the Euclidean distance between the two observations the input that is passed should be of Numerical type. So the Categorical data must be transformed or encoded into Numerical type before feeding data to an Algorithm, which in turn yields better results.

Published in

#ByCodeGarage

14 min readJun 23, 2019

Categorical data is of two types. Categorical data that are having any intrinsic ordering among themselves are called Ordinal type. Categorical data which don’t have any intrinsic ordering among themselves are called Nominal type.

Some examples of Nominal Categorical data are:

->  New York, New Delhi, New Jersey, England.
->  Pen, Pencil, Eraser.
->  Lion, Monkey, Zebra, Peacock.

Some examples of Ordinal Categorical data are:

->  Low, Medium, High.
->  Agree, Neutral, Disagree.
->  Unhappy, Happy, Very Happy.
->  Young, Old.

In this story we will discuss, various techniques to encode Categorical data which is one of the step of Data preprocessing that is to be performed before feeding data to the Machine Learning Models.

We will be working on car’s dataset to explore various techniques of Encoding Categorical data.

Download the car’s dataset by clicking here

Firstly, load all the necessary libraries required for our project. We will be loading Numpy and Pandas libraries initially. Where Numpy package is for performing Mathematical operations and Pandas package is for manipulating dataset.

# Importing all the necessary librariesimport numpy as np
import pandas as pd

The next step after importing all the basic required libraries, is to load our car’s dataset into pandas dataframe using pandas read_csv() method.

# Loading the datasetdf_car = pd.read_csv('datasets/car-data.csv')

Let’s perform basic Data Exploration before tapping our foot directly into various techniques for encoding categorical data. Let’s start by viewing few observations of the dataset loaded using pandas head() method.

# Viewing first few rows of datadf_car.head()

**First few rows of car dataset that was loaded**

From the first few rows, we noticed that the dataset comprised of both Numerical and Categorical type data. Since the dataset has nearly 26 features in it, some of the columns are not visible in this step. So let’s explore in more precise manner.

To know what are all the columns the dataset comprised of, use pandas dataframe.columns.

# Viewing the columns present in the datasetdf_car.columns

So from the above, there are columns like symboling, normalised-losses, make, fuel-type, aspiration, num-of-doors, body-style etc., in the dataset. Let’s see the datatype of all features in the dataset using pandas dtypes.

# Datatypes of columns present in the datasetdf_car.dtypes

**Datatypes of all features in the cars dataset**

From exploring datatypes of each features in the cars dataset, we came to know that the dataset comprised of both Numerical and Categorical type.

To know more about the null values in the dataset and number of observations in the dataset use pandas info() method which returns general information about all features in the dataset.

# General Information regarding the datasetdf_car.info()

**General Information regarding the dataset**

There are about 205 observations in the dataset and there are null values present in normalized-losses, bore, stroke, horsepower, peak-rpm, price columns which are Numerical type, so handling these all Numerical features are out of scope of this story. If you’re curious to know different methods to handle Missing Values in the dataset.

Check out my post on Handling Missing Values- A Comprehensive guide on Handling Missing Values

The Categorical feature having null values in the dataset is num-of-doors. It has nearly 2 null values in it. We will be discussing the methods to handle these null values later in this story.

Since Numerical features are out of scope of this post. We will be creating a new dataframe which includes only Categorical features using pandas select_dtypes() method and viewing first fews rows of new dataframe.

# Including columns which are of object datatype in modified dataframedf_car_mod = df_car.select_dtypes(include=['object'])# Viewing first few rows of datadf_car_mod.head()

Checking for null values in the modified dataframe using pandas isnull() method. It returns a boolean value as TRUE if any Null values are present and as FALSE if it is a Non-Null value.

# Checking for any null values present in the datasetdf_car_mod.isnull().sum()

**Number of Null values in the modified dataframe**

There are only two null values in num-of-doors feature in the modified dataframe. Null values in the Categorical features can be handled in two ways. One is imputing the most frequent value and the another is predicting the missing value by training the Machine Learning models like Classification, Regression etc with available data. To know more about Handling Missing values check out my post.

Now we are imputing those 2 Missing Values with the Most Frequent value in the num-of-doors column using pandas fillna() method.

# Replacing null values with most frequent valuedf_car_mod['num-of-doors'] = df_car_mod['num-of-doors'].fillna(df_car_mod['num-of-doors'].value_counts().index[0])

After imputing all the Null values with the Most Frequent value, checking for any null values in the dataframe.

# Checking for null valuesdf_car_mod['num-of-doors'].isnull().sum()

Number of null values in the num-of-doors column

Pandas get_dummies()

This is one of the approach and also an each one to encode Categorical data. pandas get_dummies() method takes categorical feature as an argument. Then it creates a Dummy Variable for every label in the feature, such that each dummy variable holds data as 1 or 0. 1 indicates the presence of a particular label and 0 indicates the absence of a particular label.

For example, if a feature contains labels as Male and Female. After applying pandas get_dummies() to that feature, Dummy variable for both Male and Female labels are created. i.e., Dummy variables are created for every label in the feature. So that those Dummy variables holds 1 in the presence of that label and as 0 in the absence of that label.

Now we are applying pandas get_dummies() to fuel-type feature in the dataframe.

# Value counts for fuel_typedf_car_mod['fuel-type'].value_counts()

# Encoding fuel_type using get_dummiesdf_car_mod = pd.get_dummies(df_car_mod, columns=['fuel-type'], drop_first=True)

In the above code snippet, we passed an argument drop_first=True, this indicates it drops one Dummy Variable from all the dummy variables created for a feature. It is to be dropped because it causes Dummy variable Trap, which is called as some of the features are highly correlated which results in predicting of another feature. So to avoid this situation, we are dropping one dummy variable created after encoding.

Let us view few rows of data after encoding fuel-type column using pandas sample() method.

# Few rows of encoded datadf_car_mod.sample(10)

So get_dummies() created an column fuel-type_gas, which holds data as 1 if it is gas and as 0 if it is diesel. Since we have dropped one column to avoid dummy variable trap, only one feature fuel-type_gas has been created.

Find / Replace

Another simplest way to encode Ordinal Categorical data, is to find the replace the value for each label, that should satisfy the intrinsic ordering among them.

Let’s replace the values in the num-of-doors feature using pandas replace() method.

## Find and Replace value in the dataset
# Value counts for num-of-doors columndf_car_mod['num-of-doors'].value_counts()

**Value counts for num-of-doors column**

So there are only two labels in the num-of-doors columns. Such that there is an intrinsic ordering among themselves as four > two. So replacing four as 4 and two as 2 (4>2).

# Replacing values in num-of-doors columndf_car_mod['num-of-doors'] = df_car_mod['num-of-doors'].replace('four', 4)
df_car_mod['num-of-doors'] = df_car_mod['num-of-doors'].replace('two', 2)

Let’s check the num-of-doors feature after replacing the values.

# Value counts for num-of-doors columndf_car_mod['num-of-doors'].value_counts()

If there are more number of labels in the feature to replace, then we can create a dictionary with keys a labels and can pass to the pandas replace() method.

# Create a dictionary to find and replace valuesdic_to_replace = {"num-of-doors": {"four": 4, "two": 2}}df_car_mod.replace(dic_to_replace, inplace=True)# View first few rows of datadf_car_mod['num-of-doors'].head()

**First few rows of num-of-doors feature data**

Label Encoder

Scikit Learn provides a lot of Encoders and Transformers, to encode Categorical data. One of those encoders is Label Encoder, which assigns a unique value or number to each label in the feature. Let’s start encoding make feature using Label Encoder.

# Using SciKit Learn
# Encoding make column using LabelEncoderfrom sklearn.preprocessing import LabelEncoderlabelencoder = LabelEncoder()df_car_mod['make_encoded'] = labelencoder.fit_transform(df_car_mod['make'])# Viewing few rows of make and its encoded columnsdf_car_mod[['make', 'make_encoded']].sample(20)

**Few rows of make and make_encoded features**

From the above results when passed make feature as an argument to Label Encoder , nissan is encoded as 12 , mazda as 8 , mercedes-benz as 9 , mitsubishi as 11 and toyota as 19.

Unlike pandas get_dummies(), Label Encoder doesn’t creates any dummy variables, it encodes data into an Numerical type by assigning an unique value to each label. We can use Label Encoder and One Hot Encoder combinedly to encode data by creating dummy variables.

Label Binarizer

Label Binarizer is an SciKit Learn class that accepts Categorical data as input and returns an Numpy array. Unlike Label Encoder, it encodes the data into dummy variables indicating the presence of a particular label or not. Encoding make column data using Label Binarizer.

# Enoding make column using LabelBinarizerfrom sklearn.preprocessing import LabelBinarizerlabelbinarizer = LabelBinarizer()make_encoded_results = labelbinarizer.fit_transform(df_car_mod['make'])

Let’s see the Dummy variables or labels that were created after encoding with Label Binarizer .

# Classes created in make column after encodinglabelbinarizer.classes_

**Classes that were created after encoding**

Since the output from Label Binarizer is an Numpy array, we can convert numpy array to a pandas dataframe using pandas DataFrame() method.

# Converting an numpy array into a pandas dataframedf_make_encoded = pd.DataFrame(make_encoded_results, columns=labelbinarizer.classes_)# Viewing few rows of datadf_make_encoded.sample(10)

Now we can clearly see that every label in the make column has been created as an Dummy variable and they holds the data about the presence of particular label or not.

MultiLabel Binarizer

MultiLabel Binarizer works similar to Label Binarizer, but MultiLabel Binarizer is used when any feature containing records having Multi Label’s. Lets work with Multi Label Binarizer with sample data since our loaded dataset doesn’t have any records with Multi Label’s.

# Creating an MultiLabel Arraymultilabel_feature = [("New Delhi", "New York"),
                     ("New York", "Sydney", "Hyderabad", "Bangalore"),
                     ("Hyderabad", "Sydney", "Chennai"),
                      ("Chennai", "New Delhi", "Bangalore"),
                     ("Bangalore", "Chennai")]# Printing the MultiLabel Arrayprint(multilabel_feature)

Sample data having Multi Labels

We have created an list having multi labels in each records. Now we are instantiating the MultiLabelBinarizer() and will pass the data to it.

# Encoding MultiLabel data using MultiLabel Binarizer from sklearn.preprocessing import MultiLabelBinarizermultilabelbinarizer = MultiLabelBinarizer()multilabel_encoded_results = multilabelbinarizer.fit_transform(multilabel_feature)# Classes created in MultiLabel data after Encodingmultilabelbinarizer.classes_

Classes created after encoding

Converting the numpy array into a pandas dataframe and viewing the data.

# Converting an Numpy Array into a pandas dataframedf_multilabel_data = pd.DataFrame(multilabel_encoded_results, columns=multilabelbinarizer.classes_)# Viewing few rows of datadf_multilabel_data.head()

Ordinal Encoder

Ordinal Encoder of SciKit Learn is used to encode categorical data into an Ordinal Integers. That means it transforms all categorical labels in a feature into where they can have any intrinsic ordering among them. The intrinsic ordering that was present among may not be true in all cases, one of them we will be seeing as an example below. So best choice is to go for replacing values and not this Ordinal Encoder, in case of Ordinal Categorical data.

Let’s create an sample ordinal categorical data and then we will apply Ordinal Encoder to that data to encode that data.

# Creating an Pandas dataframe for ordinal datadata = {'Employee Id' : [112, 113, 114, 115], 'Income Range' : ['Low', 'High', 'Medium', 'High']}df_ordinal = pd.DataFrame(data)# Viewing few rows of created dataframedf_ordinal.head()

# Encoding above ordinal data using OrdinalEncoderfrom sklearn.preprocessing import OrdinalEncoderordinalencoder = OrdinalEncoder()ordinalencoder.fit_transform(df_ordinal[['Income Range']])

**Output returned after encoding using Ordinal Encoder**

From the above two results, we can observe that Low is encoded as 0 whereas High is encoded as 1 and Medium as 2 . Where intrinsic ordeing among the original data is as High > Medium > Low. Trying to write the same intrinsic ordering equation with encoded values is as 1 (High) > 2 (Medium) > 0 (Low) which is false mathematically. So try to avoid this for ordinal data.

Factorize Method

We can acheive the ordinal data encoding with proper ordering among themselves by creating an intrinsic ordering among labels using pandas Categorical() and the converting to integers using pandas factorize() method so that we can get the encoded data with proper ordering among themselves.

# Using pandas factorize method for ordinal datacategories = pd.Categorical(df_ordinal['Income Range'], categories=['Low', 'Medium', 'High'], ordered=True)# Order of labels set for datacategories

Ordering among the labels

# Factorizing the column datalabels, unique = pd.factorize(categories, sort=True)
df_ordinal['Income Range'] = labels# Encoded Income Range Datadf_ordinal['Income Range']

**Encoded data using pandas factorize method**

Now we can see the original ordering retained in the encoded data such as Low encoded as 0 , Medium as 1 and High as 2 .

DictVectorizer

DictVectorizer of Scikit Learn library encodes the categorical data in such a way that it encodes every label in the feature into Dummy variables, which holds data regarding the presence of particular label or not. DictVectorizer is applicable only when data is in the form of dictonary of objects. Let’s work on sample data to encode categorical data using DictVectorizer. It returns Numpy array as an output.

# DictVectorizer for encoding data# Creating a dictionary for sample datadata_prices = [
    {'price': 850000, 'rooms': 4, 'neighborhood': 'Queen Anne'},
    {'price': 700000, 'rooms': 3, 'neighborhood': 'Fremont'},
    {'price': 650000, 'rooms': 3, 'neighborhood': 'Wallingford'},
    {'price': 600000, 'rooms': 2, 'neighborhood': 'Fremont'}
]# Encoding the data using DictVectorizerfrom sklearn.feature_extraction import DictVectorizer# Instantiating the DictVectorizer objectdictvectorizer = DictVectorizer(sparse=False, dtype=int)data_prices_encoded = dictvectorizer.fit_transform(data_prices)# Features names of encoded datadictvectorizer.get_feature_names()

Converting the numpy array into an pandas dataframe and view few records to ensure that the categorical data that was passed as input is encoded or not.

# Converting encoded data into pandas dataframedf_prices = pd.DataFrame(data_prices_encoded, columns=dictvectorizer.get_feature_names())# viewing few rows of datadf_prices.head()

ColumnTransformer

ColumnTransformer of SciKit Learn will transform the columns that are passed as an argument, where to which format the data is to be transformed (type of transformer) is also passed in the argument. Type of transformers like Normalizer, OneHotEncoder etc. We will be using OneHotEncoder as transformer to encode data. Let’s encode categorical data of drive-wheels and engine-location columns in loaded dataframe using ColumnTransformer.

# Encoding drive-wheels and engine-location columns using ColumnTransformer and OneHotEncoderfrom sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoderctransformer = ColumnTransformer([("encoded_data", OneHotEncoder(sparse=False), [4,5]),])ct_encoded_results = ctransformer.fit_transform(df_car_mod)# Get Feature Names of Encoded columnsctransformer.get_feature_names()

Converting the numpy array into a pandas dataframe and viewing few rows of data.

# Converting the numpy array into a pandas dataframedf_ct_encoded_data = pd.DataFrame(ct_encoded_results, columns=ctransformer.get_feature_names())# Viewing first few rows of datadf_ct_encoded_data.head()

So that total three dummy variables are created for drive-wheels column. And also two dummy variables are created for engine_location column. One dummy variable from each are dropped to avoid dummy variable trap.

# Dropping dummy variables to avoid multicollinearitydf_ct_encoded_data.drop(['encoded_data__x0_4wd', 'encoded_data__x1_front'], inplace=True, axis=1)# Viewing few rows of data after dropping dummy variblesdf_ct_encoded_data.head()

If a whole dataframe is encoded, then there will no issue directly it will be assigned to the dataframe itself. If only one or more than one columns are encoded as in the above step, the dataframe obtained as output and original dataframe are concateneted to continue further.

# Concatenating the encoded dataframe with the original dataframedf = pd.concat([df_car_mod.reset_index(drop=True), df_ct_encoded_data.reset_index(drop=True)], axis=1)# Dropping drive-wheels, make and engine-location columns as they are encodeddf.drop(['drive-wheels', 'engine-location', 'make'], inplace=True, axis=1)# Viewing few rows of datadf.head()

**Few rows of data in the new dataframe**

OneHotEncoder

OneHotEncoder of SciKit Learn encodes categorical data by creating Dummy variables for each label in the feature that was passed as an argument. It accepts only Numerical data as input. So the categorical data that needs to be encoded is converted into Numerical type by using LabelEncoder. Then passing it to the OneHotEncoder object and the output will be an Numpy array. It was one of the most preferred method.

Let’s encode categorical data of aspiration column in the dataframe using OneHotEncoder.

# OneHotEncoder# Encoding aspiration using LabelEncoderfrom sklearn.preprocessing import LabelEncoderlenc = LabelEncoder()df['aspiration'] = lenc.fit_transform(df['aspiration'])# Classes in the encoded datalenc.classes_

Classes in encoded aspiration column

The data in the aspiration column is converted into Numerical type using LabelEncoder. Now pass this data to the OneHotEncoder object.

# Encoding using OneHotEncoderfrom sklearn.preprocessing import OneHotEncoderohe = OneHotEncoder(categorical_features=[0], sparse=False)ohe_results = ohe.fit_transform(df[['aspiration']])# Converting OneHotEncoded results into an dataframedf_ohe_results = pd.DataFrame(ohe_results, columns=lenc.classes_)# Viewing first few rows of datadf_ohe_results.head()

**First few rows of encoded aspiration dataframe**

We might get DeprecationWarning as OneHotEncoder class will be removed in future and alternative to that is we can use ColumnTransformer with OneHotEncoder as transformer to encode categorical data.

Now skipping the concatenation step by keeping the length of the post in mind.

We can also encode all categorical columns at one using OneHotEncoder in the following way.

# To perform OneHotEncoder for all Categorical columns
# Categorical columns present in the dataframecategorical_cols = df.columns[df.dtypes==object].tolist()categorical_cols

Categorical columns in the dataframe

Now pass these categorical columns to LabelEncoder to convert the data present in those columns to a Numerical type.

# Performing LabelEncoding for remaining all categorical featuresfrom sklearn.preprocessing import LabelEncoderle = LabelEncoder()df[categorical_cols] = df[categorical_cols].apply(lambda col: le.fit_transform(col))# Viewing first few rows of datadf[categorical_cols].head(10)

**Few rows of data after LabelEncoding**

Now pass this three columns to OneHotEncoder object to get an encoded numpy array output.

from sklearn.preprocessing import OneHotEncoderonehotencoder = OneHotEncoder(sparse=False)onehotencoder.fit_transform(df[categorical_cols])

**Numpy array after performing OneHotEncoding**

Hurrah..!! Those were the techniques for encoding Categorical data. By concatenating the encoded dataframe’s and original dataframe properly and also taking proper care on dummy variable trap, we can obtain the encoded dataframe. By that data we can start training machine learning models.

GitHub Link of the Repository- https://github.com/itzzmeakhi/Medium/tree/master/EncodingCategoricalData

Let me know if you have anything to ask. Do share the story and clap it if you liked it.
Know more about me- itzzmeakhi.me

Encoding Categorical data in Machine Learning

Written by Akhil Reddy Mallidi