Reducing omitted variable bias with fixed effects regression models

Linear model solutions for when you suspect time-invariant confounders in your panel data

Published in

Data Science at Microsoft

9 min readOct 18, 2022

Introduction

In industry as well as in academia, it is commonplace to collect and build models on time series data with entities that have only a few attributes. For example, perhaps we have a series of sales data on customers or longitudinal visits data on patients. Due to a lack of additional data collection, confidentiality constraints, privacy regulations, or simply not having modeling in mind, we may end up with datasets that may not contain critical variables that are needed to explain a certain phenomena. While this is not necessarily a problem if our goal is prediction, omitted variables can cause biased estimates in linear models when we try to understand the effects of our features on the model outcome. This is what we call the omitted variable bias (OVB).

This article explains what OVB is and proposes a panel data estimation method, namely fixed effects regression modeling, to circumvent this issue.

Approaches to panel data

The kind of data that we have often dictates what data science methods we can apply to extract insights and provide predictions to improve business outcomes or drive decisions. At Microsoft, as within many retail, health, education, government, finance, and other industries, data is frequently structured along two dimensions: time and entity. The entity refers to the unit (i.e., customer, country, account, and more) that we observe. The time dimension refers to observations on each entity collected over time. When our data contains only entities and no time dimension, we call that cross-sectional data. When it contains only the time dimension on a single entity, we refer to it as time-series data. When the data structure contains both dimensions as rows, we refer to it as panel data and it requires specific methods for analysis. Notice that these data structures are not about the columns of the data: It’s possible to have one or more columns (independent variables) in a dataset for any of the three structures. See Figure 1 for a visual representation.

We can move from one level of analysis to another simply by aggregating the data in different ways. For example, the analyst can group data by entity and aggregate the features using one of the summary statistics (see the cross-sectional data in Figure 1). One drawback of this method is the loss of observations over time, reducing the number of rows and, as a result, the potential explanatory power of time-variant variables.

Another common method is to forecast each entity separately and independently (see the time series data in Figure 1). Similar to the cross-sectional analysis mentioned above, this method also allows for independent variables to be included in the model, enabling feature importance calculations (i.e., the size of the effects of factors used in the model to make a prediction.) Typically, these time series forecasting methods (i.e., SARIMA, Prophet) forecast one entity at a time, potentially limiting the explanatory power of the full data.

These methods are excellent ways to learn from data if our goal is to predict an outcome. Using these, we can assess the accuracy of the models and deploy them if we are satisfied with the results. If we also want to learn about feature importance, there are established methods to do so, such as using Shapley values.¹ Shapley values are not within the scope of this article, but there are many resources online about them.

So, how might we ensure that we use the full richness of the dataset to gain the most explanatory power from our independent variables?

There is another linear methodology called panel data analysis with fixed or random effects. Panel data analysis adds dummy variables for each entity, which we call “fixed effects,” so that we can control for unknown or unseen entity-level factors for which we do not have data. This is a very useful statistical trick that alleviates the omitted variable bias, and as a result, calculates more robust standard errors and feature importance.

Let’s first talk about what omitted variable bias is and then move on to how panel data analysis can solve the problem.

Omitted variable bias

Cross-sectional time series data is everywhere. Businesses accumulate data over time from their clients, medical professionals collect records at each visit for patients, and countries build time-series measures of their development over the years. One way to analyze this type of data structure is to use fixed effects, or dummy variables, for each entity in the dataset to control for variables that we might not be able to observe or measure such as culture, business practices, history, regulations, and more. In the absence of a complete dataset of information about our entities, this methodology can generate more robust feature importance values, as well as significantly reduce the omitted variable bias problem in statistics.

But before I talk in more detail about fixed effects, let’s understand what we mean by the omitted variable bias (OVB).

In statistical analyses, this problem occurs when we omit independent variables that have a meaningful effect on the dependent variable from the model. This would be acceptable if the omitted variable is uncorrelated with other independent variables because the effect of the omitted variables is adequately captured by the error term. When they are correlated, however, this might introduce a bias in our estimates because now the error term might be correlated with our independent variables. This bias is called the omitted variable bias in linear regression.²

To illustrate, let’s say that we are trying to estimate the value of real estate. We have the independent variables of square footage, number of bedrooms, bathrooms, view, condition, age, and more. However, our analyst did not collect data on the floor level of the units. This feature of floor level is highly correlated to our dependent variable of value, and it is also correlated with the regressors such as the size of the property or the view. This creates the OVB and generates biased coefficients for the regressors in the model. The direction of the bias depends on the direction of the correlations among the omitted variable and the dependent and independent variables. In this example, if a higher floor level is desirable, the coefficient for view, which is highly correlated with floor level, will be biased upward to compensate for the fact that floor level is omitted.

How does fixed effects modeling work to correct for OVB?

Using fixed effects in the regression corrects for at least some of the OVB by introducing entity-level dummy variables with control for all entity-specific and time-invariant variation in the data without having to collect data on those variables. This in turn leads to less biased or unbiased estimates for the other regressors in our model. The model equation looks like this:

where:

y is the dependent variable,

i refers to entities,

t refers to the time dimension,

x is the independent variable,

β is the coefficient for x,

∈ is the error term, and

αᵢ refers to the “fixed effects,” or entity-specific and time-invariant components (i.e., country, industry, and company-specific factors that we don’t know about and can’t otherwise control for).

Mathematically, this is equivalent to including a dummy variable for each of the entities. It is important to control for these unobserved, entity-specific factors because their absence from the model may bias significantly the causal inferences we are trying to draw.

Using this method is most appropriate when we are interested in the impact of variables over time. The time-invariant variables will be absorbed in the entity dummies and will not generate any coefficients after the model is run. The key assumption here is that there is an entity-level factor that may bias our results and therefore we control for that. This assumption almost always holds true in the company, individual, country, or other entities we often analyze.

Implementation of a simple fixed effects model with linearmodels in Python

The linearmodels package in Python is an excellent implementation for panel data-specific models. It adds to the statsmodels package some of the most widely used panel data methods such as fixed effects, random effects, first difference, and two-stage least squares. The packages simplify the implementation of these models by accepting any Pandas dataframe with a MultiIndex composed of entities and time. A one-line way to set your dataset into the panel data format ready for modeling is as follows:

df = df.set_index([“entity_code”, “date_time”])

After setting the index, we fit the model as follows:

Note that we are adding a constant to the model equation using statsmodels add_constant method. We use PanelOLS to run a fixed effects model. Fixed effects are generated for the model when the parameter entity_effects is set to True. The drop_absorbed parameter drops the time-invariant variables which are absorbed by the fixed effects dummies.

An alternative: Random effects

To recap, we decided to use fixed effects, or adding dummy variables for each entity in the model, because we think that the entity-level variables are correlated with other, time-variant independent variables in the model. But what if we lived in a slightly more optimistic world where those correlations did not exist? If we were in that world, and we thought it would be useful to include those entity-level effects in the model, then we can turn to the random effects model.

In this case, our time-invariant variables are also estimated and get coefficients after the model is run, unlike the fixed effects option. This is very useful if we think that our customers’ individual characteristics that do not typically change over time (like a company’s sector, a person’s race, or a customer’s preferred store) are important factors explaining our target variable (like services, health outcomes, purchases, and more). But this works only if the above assumption holds, specifically that there is no correlation between these variables and our other independent variables.

The linearmodels package has a random effects method that can easily be implemented as follows:

Selecting between fixed effects and random effects

How can we determine whether we can use random effects, or whether we need to use fixed effects? Fortunately, Hausman³ developed a statistical test to see whether the assumption for fixed effects holds true. We can implement this test in Python and conclude whether the entity-level factors are correlated with the other independent variables. If we reject the null hypothesis, we conclude that we need fixed effects because we rejected that the variables were independent from one another.

More often than not, we will see that we need to use fixed effects to control for entity-level variation in the model. Even though we cannot include entity-specific variables in our model in the fixed effects case, we can at least go a long way in correcting the omitted variable bias and have more consistent estimates. This, in turn, makes us more confident in the feature importance that we report to stakeholders and helps us make more accurate business recommendations.

Conclusion

In this article, I have outlined common methods to work with longitudinal cross-sectional data sets, also known as panel data sets, which are ubiquitous in business. If we have a dataset consisting of customer behaviors over time on a limited number of attributes, we may get more correct feature importance by running a fixed effects model that controls for potentially omitted variables at the customer level. When business stakeholders ask for model explanation and why we arrived at certain results, fixed effects can be used as a handy tool to explain our findings and provide more accurate insights that our stakeholders can act on.

Ceren Altincekic is on LinkedIn.

Citations

[1] For more information on Shap and its use cases, please see: https://shap.readthedocs.io/en/latest/index.html

[2] For a more detailed explanation, please see https://www.econometrics-with-r.org/6-1-omitted-variable-bias.html

[3] Hausman, J. A. 1978. “Specification tests in econometrics.” Econometrica 46: 1251–1271. https://doi.org/10.2307/1913827.