Three Simple Methods for Dealing with Outliers in Regression Analysis

Davin Cermak
7 min readJul 5, 2022

--

My Motivation for Writing on This Topic

As a freelancer, I am often asked to explain how I handle outliers in my data analysis. As a long-time economist, my first answer is always “it depends” which is followed by the most important part of my answer: why it depends. In this post, I will discuss how to think about outliers in linear regression. Simple density distribution plots help to understand how outliers affect linear regression results. I will conclude by discussing three methods for dealing with outliers and when it is appropriate to use them.

This post will explore the effect of an outlier using randomly generated house data prices for two different neighborhoods. The assumption will be that house prices between the two neighborhoods are different. Perhaps we are a realtor looking for the neighborhood that best fits our client’s budget or we want to implement a social program to help low-income homeowners and want to know which of the two neighborhoods to target. Regardless of the reason, I use a linear regression model to better understand these two neighborhoods.

The first step is to generate normally distributed data to mimic a survey of house prices for two fictional neighborhoods. The randomly generated “Low Price” neighborhood (Neighborhood_1) will use a mean house price of $150,000 with a sample of 30 homes with a standard deviation of $20,000, while the “High Price” neighborhood (Neighborhood_2) will use a mean house price of $350,000 with a sample of 30 homes with a standard deviation of $50,000.

A Wonderful Linear Regression Model of House Prices

We will start first by getting a feel for the data that was generated using the above conditions and then apply a linear regression model. The following density distribution plot and summary statistics of house price samples from the two neighborhoods show that the distributions around their means are distinctly different, as we were hoping based on our assumption about the neighborhoods.

A few important takeaways here. First, the sample data are normally distributed for each neighborhood, which is well-suited for linear regression modeling. Second, there are no overlapping prices between neighborhoods in the sample. Third, the mean house price in Neighborhood_1 is lower than the mean house price in Neighborhood_2. From just these pieces of information, we can feel pretty good that our assumption about the neighborhoods is correct.

The regression results above give us a great deal of confidence about our assumption being correct. The regression results tell us that, based on our sample data, the likelihood that house prices in the two neighborhoods being similar is extremely low.

In addition, the Q-Q plot below shows graphically that no outliers in the data have a negative effect on the model’s residual values, meaning that the linear regression assumption of normally distributed residuals is met.

Just One Outlier Can Ruin an Otherwise Beautiful Linear Regression Model

So, what would the above analysis look like if we replaced one house price from Neighborhood_1 with a house price of $5 million? There are several reasons this could happen and I will discuss later those in the article. But for now, we will assume neither a priori knowledge of the outlier nor why the reason for it when it is discovered.

Once again, we will start by getting a feel for the sample data from the following density distribution, and summary statistics of sample house prices, and apply a linear regression model.

The most important difference in this density distribution chart from the previous chart is that all the house prices in Neighborhood_2 are now within the range of Neighborhood_1 house prices. While the Neighbhood_2 prices are normally distributed around its mean price, we can no longer say the same about house prices in Neighborhood_1 because of our single outlier price. Also, notice the mean prices between the two neighborhoods. There is no longer that large of a difference between the two prices and when considering the very large standard deviation of Neighborhood_1 prices, we would have to question our assumption.

The linear regression results from the model with a large outlier do not give us much confidence about our working assumption being correct. Although the constant value, which is the mean house price in Neighborhood_1, remains statistically significant, it is over 2 times larger than the model without the outlier. The coefficient for Neighborhood_2 is no longer statistically significant, which further leads us to the conclusion there is may not be a difference in house prices between the two neighborhoods. While the r-squared metric in the first models tells us that 86% of variance in house prices is explained by the difference in neighborhoods, it is nearly zero in the outlier model. The F Statistic, which tests the hypothesis of the likelihood that the means between two populations are significantly different based on sample data, no longer supports our assumption.

We can now see from the Q-Q plot that there is one outlier in the price data that is ruining our otherwise wonderful linear regression model!

Three Methods for Handling the Outlier

How to deal with outliers depends on understanding the underlying data.

Method 1: “Fogetaboutit…”

One option to dealing with outliers can be to drop the observations altogether. This can be a suitable option if it can be determined through further investigation that the survey entry was made in error. Perhaps a search of property tax records or even personal knowledge of properties in the neighborhood can lead to this conclusion.

The regressions result from the “Exclude Outlier” model are very similar to the regression results from the “No Outlier” model first examined, with the only actual difference in the constant value and number of observations between the two models. The large r-square and F-Statistic values give us strong confidence that, based on the sample data, house prices in the two neighborhoods are not similar.

Keep in mind that this is largely because I have intentionally generated data for 2 groups that were normally distributed and added only one outlier. Real-world data sets are seldom this simplistic and would likely generate regression results that are not nearly as similar as the “No Outlier” case.

Method 2: Replacing The Outlier With a Another Value

If there is reason to believe that there could be reason to include outliers in the model, another option is to set a ceiling or floor for the variable in question. This can be beneficial if we can assume that the price is an anomaly, but there are multiple independent variables we can reasonably assume are accurate and would be useful to include in the model. Here, we may know that the maximum house price in Neighborhood_1, $196,206, is the most expensive house. Without knowing the true value of the outlier, we can assume that it is not larger than the maximum collected in our original, non-outlier sample, so we will replace the outlier with the maximum house price for ‘Neighborhood_1’.

The regression results from this model are like both the “No Outlier” and “Exclude Outlier”. Again, the larger-square and F-Statistic values give us strong confidence that, based on the sample data, house prices in the two neighborhoods are not similar. Keep in mind that using this method on real-world data may also generate regression results that are not nearly as similar as the “No Outlier” case.

Method 3: Assign a Dummy Variable to Outliers

This is often my preferred option when dealing with outliers. It keeps allows the model to use all the sample data and also gives information about the outliers in the data. Here I will simply create a dummy variable equal to ‘1’ if a house price is greater that $500,000 and ‘0’ if it is less.

The difference to notice in this model is that we have gone from a univariate to multivariate regression model. Recall again that since we are using binary variables, we are simply calculating means of each group. The mean of Neighborhood_1 is the same as the mean when we excluded the outlier as is the coefficient for Neighborhood_2. Adding the Neighborhood_1 mean with coefficient of the Dummy Variable gives us the house price for the outlier, $5,000,000. All three of the coefficients are statistically significant and the F-Statistic gives us strong confidence that, based on the sample data, house prices in the two neighborhoods are not similar.

Interestingly , the r-squared metric, which is now equal to 1.00, tells us that variations in the neighborhood and the dummy variable completely explain the variation in house prices. This is again partially because of the normally generated data. But another issue to consider is that any additional independent/explanatory variable added to a linear regression model, like the dummy variable in this case, will generate a larger r-square statistic. This is true whether the added variable is or is not statistically significant.

--

--

Davin Cermak

A long-time economist and data analyst moving into the world of freelance consulting.