Data Science

Decoding Encoding

Say What?

Kartikaye Madhok

--

Data or digital Oil as the fancy people like to call it, is nothing like oil. It doesn’t possess a high calorific value, nor can it undergo destructive distillation to provide me with Vaseline. It like oil does come in various categories and types. Each has a specific use and can do things that other cant. Now, that my crafty little introduction is done, the author is assuming that the reader knows the different data types. Incase, my mom is reading…So, data in machine learning context exists in two major types, numeric and categorical. Numeric deals with numbers, decimals, fraction and so on.. Categorical data deals with objects, strings, et.al basically numbers and alphabets and their various subtypes.

Machines for as smart as they are cannot understand categorical data, it has to be converted into numbers for it to process it. This process of converting categorical data into numeric data is called Feature Engineering; if I were to say it in fancy words, Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

Taking a wee bit of a stroll in the park, I’d like to talk about the Machine Learning workflow and the importance of feature engineering. You may choose the skip the next part.

Coming up with features is difficult, time-consuming, requires expert knowledge. ‘Applied machine learning’ is basically feature engineering.

Andrew Ng, Machine Learning and AI via Brain simulations

If feature engineering is done correctly, it increases the predictive power of machine learning algorithms by creating features from raw data that help facilitate the machine learning process. Feature Engineering is an art, and you as a data scientist are the artist.

The Workflow is as follows:

  1. Gathering data.
  2. Cleaning data.
  3. Feature Engineering.
  4. Defining model.
  5. Training, testing model and predicting the output.
Almost 80% of a data scientists time is spent on DATA PREPERATION [Source]

And, a large chunk of that cleaning and organising data is Encoding the data.

Just to run you through all the basics once more:

Types of Data:

  1. Categorical Data
  2. Numeric Data
  3. Time Series Data
  4. Text Data

Categorical Data

  1. Ordinal Data
  2. Nominal Data

Ordinal Data

[Source]

In ordinal data, we rank the data according to there specifications. In the above example we can differentiate the chilies depending on the rank of there spiciness, but we cannot calculate the distance between the hotness. where as in the example the difference between the hot and hotter need not be same as hotter and hottest so we say that distance cannot be calculated.

other examples:

Low, Medium, High

Good, Average, Bad

High, Higher, Snoop Dogg.

Nominal Data

[Source]

Nominal data cannot be ordered and measured . In the given example we cannot order the rank for Men and Women which one is greater and which one is not.

TYPES OF ENCODINGS:

Yummy!

Dummies Encoding:

Dummies encoding is used to create a dummy data for categorical feature with the integer or float values, dummies creates new columns depending on how many categories are present in the given column. If there are ’n’ categories in a given column it creates ’n’ new columns.

One-Hot Encoding:

One-Hot Encoding also works same as dummies. Where as Dummies is used by pandas library and One-hot is used by scikit-learn library.

Here, we map each category to a vector that contains 1 and 0 denoting the presence of the feature or not. The number of vectors depends on the categories which we have in our dataset. For high cardinality features, this method produces a lot of columns that slows down the learning of the model significantly.

data = pd.DataFrame({
'gender' : ['M', 'F', 'M', 'F', 'F']
})
# create an object of the OneHotEncoder
ce_OHE = ce.OneHotEncoder(cols=['gender'])
# fit and transform and you will get the encoded data
ce_OHE.fit_transform(data)
Encoded Values using One Hot Encoding

There is a big draw back in using Dummies and One-Hot encodings if our data is getting large it means dimension of the data will be huge and this leads to curse of dimensionality, to over come these the rest encoding techniques are implemented.

Label Encoding (Ordinal Encoding):

Label encoding in nominal data assigns the values depending on the rank or order of the category, whereas in ordinal data it gives a unique value to each category hence it also called as Ordinal encoding.

The difference between label and ordinal encoding is by using label encoding we can perform encoding only on one feature at a time but by using ordinal encoding we can perform encoding on all the categorical features at the same time.

The major difference between one-hot encoding and label encoding is in one-hot encoding it creates different columns for each category whereas in label encoding it creates various labels for the categories in the same column.

At this point a question should arise in your mind.. What if there are many number of categories in a column of nominal data?

For instance, If there are 100 categories in a column then by using one-hot encoding we need to create 99 separate columns instead of this we can use One-Hot encoding with multiple Categories technique, in this technique we consider the top most repeated categories and perform one-hot encoding on those top categories, instead of creating 99 columns we create only top repeated columns.

Target/Mean Encoding:

Mean encoding is similar to label encoding, except here labels are correlated directly with the target. For example, in mean target encoding for each category in the feature label is decided with the mean value of the target variable on a training data.

The advantages of the mean target encoding are that it does not affect the volume of the data and helps in faster learning.

Instead of the categorical values it replaces the mean of the values in the column.

Count/Frequency Encoding:

In count/frequency encoding it replaces the category by count of the category repeated or by the frequency of the category repeated in the column.

Instead of the categorical values it replaces the mode of the values in the column.

Binary Encoding:

Binary encoding for categorical variables, similar to one-hot, but stores categories as binary bit strings.

First, the categories are encoded as ordinal, then those integers are converted into binary code, then the digits from that binary string are split into separate columns.

BaseN Encoding:

In binary encoding, we convert the integers into binary i.e base 2. BaseN allows us to convert the integers with any value of the base. So, if you have something like city_name in your dataset which could be in thousands then it is advisable to use BaseN as it will reduce the dimensions further as you get after using binary-encoding.

You can set the parameter base. Here, I have encoded using base value 4 and 3 on a sample dataset.

# make some data
data = pd.DataFrame({
'class' : ['a', 'b', 'a', 'b', 'd', 'e', 'd', 'f', 'g', 'h', 'h', 'k', 'h', 'i', 's', 'p', 'z']})
# create an object of the BaseNEncoder
ce_baseN4 = ce.BaseNEncoder(cols=['class'],base=4)
# fit and transform and you will get the encoded data
ce_baseN4.fit_transform(data)
# create an object of the BaseNEncoder
ce_baseN3 = ce.BaseNEncoder(cols=['class'],base=3)
# fit and transform and you will get the encoded data
ce_baseN3.fit_transform(data)

Leave One Out Encoding:

This is very similar to target encoding but excludes the current row’s target when calculating the mean target for a level to reduce the effect of outliers.

data = pd.DataFrame({
'color' : ['Blue', 'Black', 'Black','Blue', 'Blue'],
'outcome' : [2, 1, 1, 1, 2]
})
# column to perform encoding
X = data['color']
Y = data['outcome']
# create an object of the TargetEncoder
ce_TE = ce.LeaveOneOutEncoder(cols=['color'])
# fit and transform and you will get the encoded data
ce_TE.fit_transform(X,Y)

Weight Of Evidence Encoding:

This is used for categorical data in classification. WOE calculates the ’natural logarithmic(ln)’ value of the probability of target being 1 by the probability of target being 0.

It is given as WOE=ln(p(Y=1)/p(Y=0))

Probability Ratio Encoding:

It is similar to WOE but Probability ratio doesn’t calculate the natural log values it just calculates the probability of target being 1 by the target being 0.

PRE=p(Y=1)/p(Y=0)

In both the Weight of Evidence and Probability Ratio the categories are replaced by its ratio values.

Rare Label Encoding:

Rare labels are those that appear only in a tiny proportion of the observations in a dataset. Rare labels may cause some issues, especially with overfitting and generalization.

The solution to that problem is to group those rare labels into a new category like other or rare — this way, the possible issues can be prevented.

Wrapping up

These are some kinds of encoding techniques which are used in most of the time and still some techniques are not covered yet like,

  1. Helmert Encoding
  2. Hash Encoding
  3. Effect or Sum or Deviation Encoding
  4. Backward Difference Encoding
  5. M-Estimator Encoding
  6. James- Stein Encoding
  7. Thermometer Encoding

CONCLUSION

I hope you leant something new today. In this blog post we covered various techniques to deal with categorical variables in the dataset. I hope you found this article useful. You can always reach out to me through the comment section below!

Linkedin

--

--