Six steps to hone your Data: Data Preprocessing, Part 4
This tutorial answers all the following questions:
- What are the categorical values?
- What happens when categorical values are passed to ML models?
- How to encode categorical values?
We have almost covered 50% of our data preprocessing journey.
It is quite conspicuous now, that if we need our ML model to be accurate, we should take proper care of our data and hone it properly.
So far, we have successfully imported all the necessary libraries, imported our dataset and separated dependent and independent variables, and handled missing values.
If you have not seen these steps, I would suggest you check previous tutorials.
We came across a point during our discussion so far about how a machine learning model cannot interpret anything which is not in the language of 0’s and 1’s.
If that’s the case, how will our ML model read the dataset containing categorical values?
Categorical data refers to the qualitative information (textual) within our dataset.
Since our ML models are built upon mathematical equations, reading categorical data gets complicated, and thus we need to encode this data into the numerical format.
And that is what our Step 4 is!
Step 4] Encoding Categorical Data
Now that we have a clear understanding of what categorical data is and why encoding it is necessary, we can move ahead and explore the methods by which we can transform categorical data into numeric data.
- Label-encoding Method
Label Encoding is replacing the categorical values in the column by a numeric value.
To implement and understand how this method functions, let us consider a Dataset.
To carry out Label Encoding, we need to use LabelEncoder() class from the Sci-kit-learn library. This class converts textual values into digits.
Now that we have imported the library, we need to follow the same steps and separate the dataset into dependent and independent variables and then encode categorical data present in both respectively.
As you can see in the above code snippet, First we need to read our dataset and separate it into X and y accordingly and then create an object of the class and assign it to label_encoder.
The 5th line of the code will select all the rows (because of :) of the first column (because [0]) and fit the LabelEncoder to it and transform the values.
The values will then immediately be encoded to 0,1,2,3… according to the categories present.
However, there is a disadvantage for Label Encoding.
It labels the data According to the number of categories present.
For example, in our dataset, we have five different countries. Thus they are encoded by Label Encoding technique as 0,1,2,3 and 4.
Sometimes, our ML model tries to devise a correlation between these labels.
As you can see, Japan is encoded into a numeric value of 4, and India is encoded into a numeric value of 3, now 4 > 3, does that mean Japan is of higher priority than India?
Of course not!
Thus sometimes, Label Encoding ends up defeating its purpose and results in misinterpreting data.
However, Binary categorical data or Dependent categorical data such as ‘yes and no’ can be encoded using this technique as it will not harm the future accuracy of our data.
2. One-hot-encoding Method
When Label Encoding fails, the One-hot-encoding method comes into play.
Let us consider the same dataset.
First, we need to additionally import the ColumnTransformer and OneHotEncoder class from sci-kit-learn as shown below.
Then we have to import the dataset and separate it into X and y. (features and target) and create an object of the class and assign it to a variable.
Column Transformer basically combines LabelEncoding and One-Hot-Encoding into just one line of code.
The ColumnTransformer constructor takes quite a few arguments, but we’re only interested in the two important for encoding.
The first argument is an array called transformers, which is a list of tuples. The array has the following elements in the same order:
- Name A name for the column transformer, which will make the setting of parameters and searching for the transformer easy.
- Transformer: Here we’re supposed to provide an estimator. We can also just pass “passthrough” or “drop” if we want. But since we’re encoding categorical data in this example, we’ll use the OneHotEncoder method here.
- Column(s): The list of columns that you want to be transformed. In this case, we’ll only transform the first column.
The second parameter we’re interested in is the remainder. This will tell the transformer what to do with the other columns in the dataset.
By default, only the columns which are transformed will be returned by the transformer. All other columns will be dropped. Hence we pass the keyword ‘passthrough’ so that no other features are canceled out.
The result of encoding is then converted into a NumPy array.
Let us take a look at the code and output of this method.
As we can see, the output is encoded into numeric value without jeopardizing the accuracy of our model.
We are now a step ahead in our Machine Learning journey. Congratulations on completing Step 4 of Data Preprocessing!
(Image Source: Internet)