How to Interpret the Odds Ratio with Categorical Variables in Logistic Regression

An explanation of reference categories and picking the right one

Lucy Dickinson
Towards Data Science

--

Photo by Dex Ezekiel on Unsplash

Logistic regression is a very popular machine learning model that has been the focus of many articles and blogs. Whilst there are some fantastic examples with relatively simple data, I struggled to find a comprehensive article that tackled using categorical variables as features.

Why is one category of each variable omitted from the statistical output? How do I interpret the regression coefficient, that is, the ‘log odds ratio’, of a specific category of a variable?

I’ll aim to demystify this by covering the following topics:

· A recap of dummy variables

· What is the reference category and how to pick one

· How to interpret the odds ratio with dummy variables

I’ll be using Python and the statsmodels package (with its nice statistical output) and an open-source dataset of bike trips with NYC Citibike data with plenty of additional derived categorical variables to play with. From the trip duration feature, I’ve generated a boolean target variable denoting whether a bike trip exceeded 20 minutes or not which I’ve arbitrarily decided is a ‘long’ trip. We’ll be looking at which categorical features significantly increase or decrease the odds of a bike trip exceeding 20 minutes.

Let’s get started!

Handling categorical data — hello dummy variables

Data variables can be either continuous (measured values between theoretical min and max, e.g. age, weight) or categorical/discrete (fixed values or taxonomies, e.g. weekday, gender). Categorical data cannot be directly used in a machine learning algorithm, so pre-processing needs to occur.

Categorical variables can be transformed into numeric dummy variables, which is a much better format to work with. This is where the data is transposed so that each category is represented by a set of binary features, indicating the absence or presence of that category within each row of data. Two popular methods are the pandas get_dummies() and scikit-learn’s OneHotEncoder(). I’ll be focussing on the pd.get_dummies method in this post.

To avoid multicollinearity (correlation between variables), most models require that one of the dummy variables is dropped from your data. How does this work? Well, as there must be a value for at least one category of a feature per row (providing you have dealt with missing values!), if each dummy variable for a feature is 0, then by default, the last category must be 1. Hence, one dummy variable can be dropped.

So how do we drop a dummy variable?

The ‘drop_first’ argument might not be your friend

I learnt to always use the ‘drop_first=True’ argument when creating dummy variables using pd.get_dummies(). It is one of those concepts that online learning platforms cover but never seem to dive into the detail of what is actually happening when you use it.

I noticed that I was not able to pick which category was dropped. By default, it always removed the category which came first alphabetically. This defies all data science logic — surely the category matters and there must be some data-driven logic to pick which one does not make the final cut?

Let us cover that first question. Why does the dropped category matter?

Reference categories

The category that is dropped is known as the reference category. This category is the one that all other categories will be compared with when you interpret the results of your model.

With this in mind, it is worth thinking back to the questions you are trying to answer by fitting a logistic regression model to your data. What are you trying to determine about the chances of an event occurring, in our case, whether a bike trip will exceed 20 minutes or not? Are you interested in the effect of a particular category, relative to the other categories, such as a specific day of the week? (More on the interpretation later!)

As such, the reference category should be the category that allows for easier interpretation and the one that you care about the most. Mathematically speaking, this could be the category that has the largest representation in your data, i.e. that which represents the ‘norm’. Equally, it could be the category that has an unexpected presence in your data.

Visualising your data may help you decide. A simple bar chart of the frequencies or percentages of the feature categories is useful to gain an idea of their distribution within your data. It will also reveal any interesting or unexpected patterns that you might want to investigate.

Now let us consider our data of NYC bike trips. I have derived additional features from the start time of each bike trip; ‘weekday’ with seven categories representing day of the week and ‘daytype’ denoting whether the trip occurred on a weekday or weekend. You might expect weekends (Saturday and Sunday) to be the days with longer bike trips for those who use these bikes to sightsee. Indeed our exploratory visualisation below reveals that Saturday’s seem to host a slightly larger proportion of bike trips exceeding 20 minutes.

Image by author, created in matplotlib.

If your research question aims to address whether day of the week has an effect on the bike trip being over 20 minutes or not, you could set the reference category to ‘weekend’ to confirm your suspicions that the odds of a bike trip being over 20 minutes during the weekdays are significantly higher compared with the odds of a trip exceeding 20 minutes on the weekend. Alternatively, you could set your reference category to a particular day of the week to assess how other days fare in influencing the odds relative to the day you selected as the reference category.

Some other strategies for picking the reference category can be found here.

Manually picking the right reference category

So, we established that having the ability to choose which category you use as a reference is pretty important. So how do we get round that pesky default drop_first argument and choose our own?

Fortunately, there are some tools you can use to pick the right one to drop. The below code shows a quick example function to determine the categories with the highest value counts in the dataset and then drop these from your data and use them as the reference categories. With the same approach, you could modify the function to select categories that possess attributes other than being the most common in the dataset to use as the reference.

Code by author

Great, now we have a list of the most common categories. Now we can drop these from the dataframe of dummy variables.

Code by author

Great! Now we have our data prepared with dummy variables and we know which are the reference categories. We are ready to run a logistic regression!

Onto the odds ratio

I have already lightly touched on the interpretation of the odds ratio, but let us dive a bit deeper. It is helpful here to look at the output from the statsmodels after we fit a logistic regression model to our data.

statsmodel logistic regression output, image by author.

Here, you can see all the features listed on the left-hand side including the dummy variables (with the reference categories omitted!) and their corresponding statistics. Let us focus on the coefficient (coef) and p value (P>|z) in the first and fourth columns, respectively.

The coefficient represents the log-odds ratio. Whilst I won’t go super into detail here as this post has a fantastic explanation, it is worth having a high-level summary.

The odds are the probability of something happening over it not happening, denoted as (p/(1-p)). If the probability of a bike trip exceeding 20 minutes is 25%, the odds of this is .25/(1-.25) = 0.33. So out of 100 bike trips, 33 of them will exceed 20 minutes.

The odds ratio is the ratio or comparison between two odds to see how they change given a different situation or condition. The odds ratio for a feature is a ratio of the odds of a bike trip exceeding 20 minutes in condition 1 compared with the odds of a bike trip exceeding 20 minutes in condition 2.

Positive odds ratios indicate that the event is more likely to occur, whilst negative odd ratios indicate the event is less likely to occur.

Note that the coefficient is the log odds ratio. The ‘log’ part of the log-odds ratio is just the logarithm of the odds ratio, as a logistic regression uses a logarithmic function to solve the regression problem. It is much easier to just use the odds ratio, so we must take the exponential (np.exp()) of the log-odds ratio to get the odds ratio.

For categorical features or predictors, the odds ratio compares the odds of the event occurring for each category of the predictor relative to the reference category, given that all other variables remain constant.

Ok, with the theory done, let us look at a few examples to see how this works in practice.

Note. Our reference categories are the most common per feature, which in our example turned out to be:

  • [user_type] = ‘Subscriber’ (the other option is ‘customer’, which I’m assuming means a non-subscriber who pays per ride)
  • [gender] = ‘male’
  • [daytype] = ‘Weekday’
  • [under30years] = ‘over30’
  • [weekday] = ‘Wednesday’

The odds of a ride exceeding 20 minutes is 5 times higher if you are a customer than a subscriber, if all other variables are constant, given an odds ratio of exp(1.66) = 5.28. (Maybe they’re making the most out of their ride given they’re not a regular paying subscriber!)

The odds of a ride exceeding 20 minutes is 37% higher if you are a female compared with a male, if all other variables are constant, given an odds ratio of exp(0.32) = 1.37.

The odds of a ride exceeding 20 minutes is 8% lower if you are under 30 years in age compared with someone who is over 30, if all other variables are constant, given an odds ratio of exp(-0.07) = 0.92.

The odds of a ride exceeding 20 minutes barely changes on a weekday or weekend, if all other variables are constant, given an odds ratio of exp(0.002) = 1.003.

So, there we have it! I hope this has helped explain a little more about using categorical features in logistic regression and that 1) you should care about which reference category you use, 2) you have the power to choose and 3) odds ratios are not so bad to grasp with categorical variables.

You can see my full code in my GitHub repository here. Thanks for reading :)

--

--

Data Scientist | Writing about my learning journey and all things related to Data Science.