AIKISS
Published in

AIKISS

Pandas-Categorical and Continuous values encoding.

Encoding continuous and categorical values as dummies.

Photo by Iñigo De la Maza on Unsplash

Categorical and Continuous Values

Neural networks require their input to be a fixed number of columns. This input format is very similar to spreadsheet data. This input must be entirely numeric.

It is essential to represent the data in a way that the neural network can train from it. In class 6, we will see even more ways to preprocess data. For now, we will look at several of the most basic ways to transform data for a neural network.

Character Data (strings)

  • Nominal — Individual discrete items, no order. For example, color, zip code, shape.
  • Ordinal — Individual distinct items have an implied order. For example grade level, job title, Starbucks(tm) coffee size (tall, vente, grande)

Numeric Data

  • Interval — Numeric values, no defined start. For example, temperature. You would never say, “yesterday was twice as hot as today.”
  • Ratio — Numeric values, clearly defined start. For example, speed. You would say that “The first car is going twice as fast as the second.”

Encoding Continuous Values

One common transformation is to normalize the inputs. It is sometimes valuable to normalization numeric inputs to be put in a standard form so that the program can easily compare these two values. Consider if a friend told you that he received a 10 dollar discount. Is this a good deal? Maybe. But the cost is not normalized. If your friend purchased a car, then the discount is not that good. If your friend bought dinner, this is an excellent discount!

Encoding Categorical Values as Dummies

The traditional means of encoding categorical values is to make them dummy variables. This technique is also called one-hot-encoding. Consider the following data set.

There are four unique values in the areas column. To encode these to dummy variables, we would use four columns, each of which would represent one of the areas. For each row, one column would have a value of one, the rest zeros. For this reason, this type of encoding is sometimes called one-hot encoding. The following code shows how you might encode the values “a” through “d.” The value A becomes [1,0,0,0] and the value B becomes [0,1,0,0].

To encode the “area” column, we use the following. Note that it is necessary to merge these dummies back into the data frame.

Usually, you will remove the original column (‘area’), because it is the goal to get the data frame to be entirely numeric for the neural network.

Target Encoding for Categoricals

Target encoding can sometimes increase the predictive power of a machine learning model. However, it also dramatically increases the risk of overfitting. Because of this risk, you must take care if you are using this method. Target encoding is a popular technique for Kaggle competitions.

Generally, target encoding can only be used on a categorical feature when the output of the machine learning model is numeric (regression).

The concept of target encoding is straightforward. For each category, we calculate the average target value for that category. Then to encode, we substitute the percent that corresponds to the category that the categorical value has. Unlike dummy variables, where you have a column for each category, with target encoding, the program only needs a single column. In this way, target coding is more efficient than dummy variables

Rather than creating dummy variables for “dog” and “cat,” we would like to change it to a number. We could use 0 for cat, 1 for dog. However, we can encode more information than just that. The simple 0 or 1 would also only work for one animal. Consider what the mean target value is for cat and dog.

The danger is that we are now using the target value for training. This technique will potentially lead to overfitting. The possibility of overfitting is even greater if there are a small number of a particular category. To prevent this from happening, we use a weighting factor. The stronger the weight, the more than categories with a small number of values will tend towards the overall average of y. You can perform this calculation as follows.

Encoding Categorical Values as Ordinal

Typically categoricals will be encoded as dummy variables. However, there might be other techniques to convert categoricals to numeric. Any time there is an order to the categoricals, a number should be used. Consider if you had a categorical that described the current education level of an individual.

  • Kindergarten (0)
  • First Grade (1)
  • Second Grade (2)
  • Third Grade (3)
  • Fourth Grade (4)
  • Fifth Grade (5)
  • Sixth Grade (6)
  • Seventh Grade (7)
  • Eighth Grade (8)
  • High School Freshman (9)
  • High School Sophomore (10)
  • High School Junior (11)
  • High School Senior (12)
  • College Freshman (13)
  • College Sophomore (14)
  • College Junior (15)
  • College Senior (16)
  • Graduate Student (17)
  • PhD Candidate (18)
  • Doctorate (19)
  • Post Doctorate (20)

The above list has 21 levels. This would take 21 dummy variables. However, simply encoding this to dummies would lose the order information. Perhaps the easiest approach would be to assign simply number them and assign the category a single number that is equal to the value in parenthesis above. However, we might be able to do even better. Graduate student is likely more than a year, so you might increase more than just one value.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store