Encoding Categorial Variables

BuddhaReddy Polepelli
5 min readMay 19, 2023

--

There are several methods for encoding categorical variables, including

1. One-Hot Encoding

2. Dummy Encoding

3.Ordinal Encoding

4. Binary Encoding

5. Count Encoding

6. Target Encoding

Let’s take a closer look at each of these methods.

One-Hot Encoding:

• One-Hot Encoding is the Most Common method for encoding Categorical variables.

• a Binary Column is created for each Unique Category in the variable.

• If a category is present in a sample, the corresponding column is set to 1, and all other columns are set to 0.

• For example, if a variable has three categories ‘A’, ‘B’ and ‘C’, three columns will be created and a sample with category ‘B’ will have the value [0,1,0].

# One-Hot Encoding: 
# create a sample dataframe with a categorical variable
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red']})
# perform one-hot encoding on the 'color' column
one_hot = pd.get_dummies(df['color'])
# concatenate the one-hot encoding with the original dataframe
df1 = pd.concat([df, one_hot], axis=1)
# drop the original 'color' column
df1 = df1.drop('color', axis=1)

Dummy Encoding

• Dummy coding scheme is similar to one-hot encoding.

• This categorical data encoding method transforms the categorical variable into a set of binary variables [0/1].

• In the case of one-hot encoding, for N categories in a variable, it uses N binary variables.

• The dummy encoding is a small improvement over one-hot-encoding. Dummy encoding uses N-1 features to represent N labels/categories.

One-Hot Encoding vs Dummy Encoding:

One-Hot Encoding — N categories in a variable, N binary variables.

Dummy encoding — N categories in a variable, N-1 binary variables.

# Create a sample dataframe with categorical variable
data = {'Color': ['Red', 'Green', 'Blue', 'Red', 'Blue']}
df = pd.DataFrame(data)
# Use get_dummies() function for dummy encoding
dummy_df = pd.get_dummies(df['Color'], drop_first=True, prefix='Color')
# Concatenate the dummy dataframe with the original dataframe
df = pd.concat([df, dummy_df], axis=1)

Label Encoding:

  • Each unique category is assigned a Unique Integer value.
  • This is a simpler encoding method, but it has a Drawback in that the assigned integers may be misinterpreted by the machine learning algorithm as having an Ordered Relationship when in fact they do not.
from sklearn.preprocessing import LabelEncoder
# Create a sample dataframe with categorical data
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green']})
print(f"Before Encoding the Data:\n\n{df}\n")# Create a LabelEncoder object
le = LabelEncoder()
# Fit and transform the categorical data
df['color_label'] = le.fit_transform(df['color'])

Ordinal Encoding:

• Ordinal Encoding is used when the categories in a variable have a Natural Ordering.

• In this method, the categories are assigned a numerical value based on their order, such as 1, 2, 3, etc.

For example, if a variable has categories ‘Low’, ‘Medium’ and ‘High’, they can be assigned the values 1, 2, and 3, respectively.

# Ordinal Encoding:
# create a sample dataframe with a categorical variable
df = pd.DataFrame({'quality': ['low', 'medium', 'high', 'medium']})
print(f"Before Encoding the Data:\n\n{df}\n")
# specify the order of the categories
quality_map = {'low': 0, 'medium': 1, 'high': 2}
# perform ordinal encoding on the 'quality' column
df['quality_map'] = df['quality'].map(quality_map)

Binary Encoding:

• Binary Encoding is similar to One-Hot Encoding, but instead of creating a separate column for each category, the categories are represented as binary digits.

  • For example, if a variable has four categories ‘A’, ‘B’, ‘C’ and ‘D’, they can be represented as 0001, 0010, 0100 and 1000, respectively.
# Binary Encoding:
import pandas as pd# create a sample dataframe with a categorical variable
df = pd.DataFrame({'animal': ['cat', 'dog', 'bird', 'cat']})
print(f"Before Encoding the Data:\n\n{df}\n")
# perform binary encoding on the 'animal' column
animal_map = {'cat': 0, 'dog': 1, 'bird': 2}
df['animal'] = df['animal'].map(animal_map)
df['animal'] = df['animal'].apply(lambda x: format(x, 'b'))
# print the resulting dataframe
print(f"After Encoding the Data:\n\n{df}\n")

Count Encoding:

• Count Encoding is a method for encoding categorical variables by counting the number of times a category appears in the dataset.

  • For example, if a variable has categories ‘A’, ‘B’ and ‘C’ and category ‘A’ appears 10 times in the dataset, it will be assigned a value of 10.
# Count Encoding:
# create a sample dataframe with a categorical variable
df = pd.DataFrame({'fruit': ['apple', 'banana', 'apple', 'banana']})
print(f"Before Encoding the Data:\n\n{df}\n")
# perform count encoding on the 'fruit' column
counts = df['fruit'].value_counts()
df['fruit'] = df['fruit'].map(counts)
# print the resulting dataframe
print(f"After Encoding the Data:\n\n{df}\n")

Target Encoding:

• This is a more advanced encoding technique used for dealing with high cardinality categorical features, i.e., features with many unique categories.

• The average target value for each category is calculated and this average value is used to replace the categorical feature.

  • This has the advantage of considering the relationship between the target and the categorical feature, but it can also lead to overfitting if not used with caution.
# Create a sample dataframe with categorical data and target
df = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green'],
'target': [0, 1, 0, 1, 0]})
print(f"Before Encoding the Data:\n\n{df}\n")
# Calculate the mean target value for each category
target_mean = df.groupby('color')['target'].mean()
# Replace the categorical data with the mean target value
df['color_label'] = df['color'].map(target_mean)
print(f"After Encoding the Data:\n\n{df}")

Conclusion:

Data Encoding is an important step in the pre-processing of data for machine learning algorithms. The choice of encoding method depends on the type of data and the problem being solved. One-Hot Encoding is the most commonly used method, but other methods like Ordinal Encoding, Binary Encoding, and Count Encoding may also be used in certain situations.

--

--