Categorical Variable Regression using R

Sharma Kaushambi
Analytics Vidhya
Published in
4 min readMar 11, 2021

Variables that classify observations into categories are categorical variables (also known as factors or qualitative variables). They have a limited number, called levels, of different value. The gender of people, for example, is a categorical attribute that can take up two levels: male or female.

In statistics, a categorical variable is a variable that may, on the basis of a certain qualitative property, assign each person or other unit of observation to a certain group or nominal category on the basis of one of a finite and typically fixed number of possible values. Analyzing regression involves numerical variables. Therefore, when a researcher wants to use in a regression model a categorical variable, additional steps are needed to make the outcomes interpretable.

A categorical variable that may take exactly two values is referred to as a binary variable or a dichotomous variable; the Bernoulli variable is an important special case. Polytomous variables are called categorical variables with more than two possible values categorical variables are always considered to be polytomous unless stated otherwise. The categorical variables in these steps are recoded into a collection of separate binary variables. This recoding is referred to as “dummy coding” and leads to a table called the contrast matrix being formed.

Dummy Coding

As a classifying device, dummy variables are used in that they divide the whole sample into different groups based on characteristics and implicitly allow the individual regressions for each subgroup to run.A group membership of dummy variables that take on values 0 and 1 is the dummy coding technique.That is, membership in a specific group is encoded as one where a zero is coded as non-membership in a group. However, it is arbitrary to assign 1 and 0 values to groups.The category that is given a value of zero is sometimes referred to as the foundation, benchmark, power, comparison or excluded category. In order to escape the dummy variable pit, the number of dummy variables must be less than the number of divisions or classifications of each qualitative variable. The coefficients to which the dummy variables are connected must be always be viewed in relation to the basis or reference category that assign value zero. If a model has several qualitative variables with multiple classes, a significant number of degrees of freedom can be consumed by adding dummy variables.

Dummy explanatory variables in the regression model are denoted by the symbol ‘D’ rather than the usual symbol ‘X’ to emphasize that we are dealing with qualitative variables. Using dummy coding, the regression model involving one qualitative variable as an independent variable with k groups or classes can be depicted as

where

Yij= The score on the dependent variable for subject i in group j

B0= The intercept that represents the mean of the group coded 0 on all the dummy variables

K = number of categories or classifications of dummy independent variable

Bj= The regression coefficient associated with 𝑗 𝑡ℎ group, it represents the difference between the mean of the group coded 1 on the corresponding dummy variable and the mean of the group coded 0 on all the dummy variables.

Dij= The numerical value of dummy variable assigned to subject 𝑖 𝑡ℎ in the 𝑗 𝑡ℎ group

ij= The error associated with 𝑖 𝑡ℎ subject in the 𝑗 𝑡ℎ group

Categorical variables with two levels

We will use the [car package] salary data collection, which provides a nine-month academic salary for assistant professors, associate professors and college professors.To predict an outcome variable (y) on the basis of a predictor variable (x), the regression equation can be simply written as y = b0 + b1*x. The regression beta coefficients are b0 and b1, which represent the intercept and the slope, respectively. Suppose we want to research the disparities in wages between men and women.

library(tidyverse)

data(“Salaries”, package = “car”)

sample_n(Salaries, 3)

We can construct a new dummy variable based on the gender variable which takes the value:

  • 1 if a person is male
  • 0 if a person is female

And this variable is used in the regression equation as a predictor, leading to the following model:

  • b0 + b1 if person is male
  • bo if person is female

You should interpret the coefficients as follows:

  • b0 is the average salary among females,
  • b0 + b1 is the average salary among males,
  • and b1 is the average difference in salary between males and females.

The following example models the wage gap between males and females for simple demonstration purposes, by computing a simple linear regression model on the [car package] salary data collection. R automatically constructs dummy variables:

model <- lm(salary ~ sex, data = Salaries)

summary(model)$coef

The function contrasts() returns the coding used by R to construct the dummy variables:

contrasts(Salaries$sex)

A dummy variable will be established by R that takes on a value of 1 if the sex is male, and 0 otherwise. The decision to code males as 1 and females as 0 (baseline) is arbitrary and does not affect the calculation of the regression, but does modify the coefficients’ interpretation.

You can set the baseline group for males using the relevel() function as follows:

Salaries <- Salaries %>%

mutate(sex = relevel(sex, ref = “Male”))

model <- lm(salary ~ sex, data = Salaries)

summary(model)$coef

Alternatively, we might construct a dummy variable -1 (male) / 1 (female) instead of a 0/1 coding scheme. In the model, this results in:

  • b0 — b1 if person is male
  • b0 + b1 if person is female

Thus, if the categorical variable is coded as -1 and 1, then if the coefficient of regression is positive, it is subtracted from the coded group as -1 and added to the coded group as 1. If the coefficient of regression is negative, then addition and subtraction are reversed.

For categorical variables with a large number of levels, grouping some of the levels together could be helpful.

--

--