What are LabelEncoder? Build your own LabelEncoder
Data Pre-processing is an essential part of your Data Science project, where you transform your unstructured data to algorithms friendly in fields of Machine Learning. Majorites of machine learning algorithms demand numerical inputs, so LabelEncoder is one way to deal with categorical variables(like Gender, Color, etc).
LabelEncoder helps you to transform your categorical variables into numeric form. LabelEncoder encode, categorical values between 0 to numbers of categorical values — 1.
# For Examples lets encode colors
# Actual values
colors = ["Red", "Blue", "Green"]
# Their encoded values
encoded_values = [0, 1, 2]
Let’s try to build our own LabelEncoder.
- Creating a function that takes list of categorical variables as input, sorts them and returns a dictionary object with the categorical values and its index.
def fit(classes):
unique_classes = list(set(classes))
unique_classes.sort()
encoded_label = {}
for index, label in enumerate(unique_classes):
if label not in encoded_label:
encoded_label[label] = index
return encoded_label
2. Now, create a function that takes a variable as input and returns its associated numeric value by making use of fit() function we defined above.
def transform(row):
classes = ["Green","Blue","Red"]
encoded_label = fit(classes)
return encoded_label[row]
3. Since, we have defined functions to label the variable and transform value, now it’s time to refactor code.
import pandas as pd
class LabelEncoder:
def __init__(self):
self.classes = None
def fit(self, classes):
unique_classes = list(set(classes))
unique_classes.sort()
encoded_label = {}
for index, label in enumerate(unique_classes):
if label not in encoded_label:
encoded_label[label] = index
self.classes = encoded_label
def transform(self, row):
encoded_label = self.classes
return encoded_label[row]
if __name__ == '__main__':
color = ['Red','Red','Blue','Green','Red','White']
df = pd.DataFrame()
df['color'] = color
le = LabelEncoder()
le.fit(df['color'])
df['labelled_color'] = df.map(lambda row: le.transform(row))
print(df)
And, here’s how output looks like:
color labelled_color
0 Red 2
1 Red 2
2 Blue 0
3 Green 1
4 Red 2
5 White 3
I hope this will help you. If you have read this far please feel free to leave a comment or suggestions.
Please follow and subscribe!! Thank you!!