Response Coding for Categorical Data

Dharmik Thakkar
2 min readJul 5, 2019

--

There are different types of data that one comes across while solving a machine learning classification problem. The different data types include — Numerical data, Categorical data or Text data. The machine only understands the language of 0s and 1s. The numerical data can be easily represented in binary format. However, to represent the categorical data or text data in binary form we need to use special methods. In this article we will talk about one such technique to represent categorical data — response coding.

What is Categorical Data?

In layman’s term, categorical data is a type of data which may be divided into groups. For example — The blood type of a human being is a categorical data which can have values like A, B, AB or O. Some other examples of categorical data are race, gender, age group, etc.

What is Response Coding?

It is a technique to represent the categorical data while solving a machine learning classification problem. As part of this technique, we represent the probability of the data point belonging to a particular class given a category. So for a K-class classification problem, we get K new features which embed the probability of the datapoint belonging to each class based on the value of categorical data. Mathematically speaking, we calculate —

P(class=X | category=A) = P(category=A ∩ class=X) / P(category=A)

Consider the following example of a categorical dataset which contains values for variable ‘state’ and corresponding binary class label.

As part of response coding, first we compute a response table to represent the number of data points belonging to each output class for a given category.

Once we have the response table, we encode this information by adding the same number of features in the dataset as the cardinality of the class labels to represent the probability of the data point with given category, belonging to a particular class.

References —

  1. Applied AI Course — https://tinyurl.com/yxksmq8h
  2. Categorical Variable — https://en.wikipedia.org/wiki/Categorical_variable
  3. http://www.stat.yale.edu/Courses/1997-98/101/catdat.htm

--

--