Log Transformations in Linear Regression

Samantha Knee
The Startup
Published in
8 min readJan 19, 2021

--

The basics of when to use and how to interpret

Photo by Chandler Cruttenden on Unsplash

When building a linear regression model, we sometimes hit a roadblock and experience poor model performance and/or violations of the assumptions of linear regression — the dataset in its raw form simply does not perform well. When this occurs, a log transformation may be a saving grace. However, this changes the meaning of our model, and so we need to be careful in our interpretation when a log transformation occurs.

In this article, we will explore the power of log transformation in three simple linear regression examples: when the independent variable is transformed, when the dependent variable is transformed and when both variables are transformed. After reading, you will understand which scenarios call for transforming which variables and what these transformations mean for the interpretation of the model coefficients.

Note: this is a high-level overview of when to use a log transformation and what it means for the interpretation of the model. For mathematical proofs on the concepts and more examples, please see the resources at the end of this post.

What is ‘log’?

Before we dive into linear regression, you may be thinking, “what exactly is a log transformation? I haven’t thought about ‘log’ since 10th grade pre-calc.” Let’s quickly run through the basics of the log function for anyone who may need a refresher (like I certainly did).

Most simply, a logarithm function is the inverse of the exponential function. Logarithms can have different bases, just like exponents — for example, log base 10 or log base e. Think of logᵦ(x) as the power your base needs to be raised to in order to obtain x. Therefore, log₁₀(100) = 2.

When talking about log transformations in regression, it is more than likely we are referring to the natural logarithm or the logarithm of e, also know as ln, logₑ, or simply log. In other words, the natural log of an x value is the power e must be raised to in order to get to the x value. e is a mathematical constant equal to 2.71828…

This is what the natural log function looks like graphically:

x = np.linspace(start=1, stop=100, num=10**3)
y = np.log(x)
plt.plot(x, y);
Log Function

Looking at the graph, there are a few aspects of the function we notice immediately:

  • The y-values of small x-values are further apart
  • The y-values of large x-values are closer together

These are the effects of log transforming your variables — small values become more spread out, and large values become closer together. Another caveat is that you cannot take the log of a negative number. Also, log(1) =0 and log(e) =1.

Now that we understand what exactly log is and the effect it has on numbers, let’s get back to the application we’re focused on: When do we use a log transformation in linear regression?

Transforming the Independent Variable

You have checked for all of the assumptions of linear regression, and found that your error terms are normally distributed with an equal variance. However, the relationship between x and y is not quite linear.

For this example, we are going to use the auto-mpg dataset. See the code and plots below as an example of the relationship between miles per gallon (mpg) and the displacement of the car:

data = pd.read_csv('auto-mpg.csv')
sns.scatterplot(x = 'displacement', y = 'mpg', data=data);
Scatter plot of displacement vs. mpg
sns.histplot(data.displacement, bins='auto');
Histogram of displacement variable

Although a normal distribution of the predictor variable is not a requirement of linear regression, it can help improve the accuracy of our model. Looking at a histogram of the displacement variable, it looks like normality could use some improvement.

The relationship between mpg and displacement doesn’t exactly look linear. Let’s check the results of running a simple linear regression model using displacement as our independent variable.

outcome = 'mpg'
predictor = 'displacement'
formula = outcome + '~' + predictor
model = ols(formula=formula, data=data).fit()
model.summary()
OLS result for mpg vs. displacement

Our R² value is .65, and the coefficient for displacement is -.06. This means that a 1 unit change in displacement causes a -.06 unit change in mpg.

Now, let’s apply a log transformation to displacement by adding a column to our dataset called ‘disp_log’, and see if using this column as our independent variable improves our model at all:

data['disp_log'] = np.log(data['displacement'])
outcome = 'mpg'
predictor = 'disp_log'
formula = outcome + '~' + predictor
model = ols(formula=formula, data=data).fit()
model.summary()
Scatter of log of displacement vs. mpg

It worked! The relationship looks more linear and Our R² value improved to .69. As a side note, you will definitely want to check all of your assumptions for linear regression again to make sure you didn’t create a new problem with your transformation. For this example, we’ll just assume we meet the other assumption criteria.

Our work is not done here, though. As data scientists, we can’t just arbitrarily change our data without understanding how it changes our interpretation!

We see our disp_log coefficient is now -12.2, very different from our original coefficient of -.006. Since this is the log of x and not x itself, we can no longer say this represents the unit change in y for a 1 unit-change in x. When the independent variable is transformed, a 1% increase in the independent variable increases or decreases the dependent variable by the coefficient/100 units. So in this case, a 1% increase in displacement decreases mpg by .12 units.

Now that we understand when to log transform our independent variable and what it means for the interpretation of our coefficient, let’s explore a scenario where we transform our dependent variable.

Transforming the Dependent Variable

In this scenario, you have checked your assumptions and found linearity is not a problem, but your residuals are exhibiting non-normality. Let’s revisit our auto-mpg dataset using the ‘weight’ variable.

Weight vs. mpg scatter plot

The relationship looks relatively linear. Next, run the simple regression model to obtain the baseline results.

outcome = ‘mpg’
predictor = ‘weight’
formula = outcome + ‘~’ + predictor
model = ols(formula=formula, data=data).fit()
model.summary()

An R² value of .69 isn’t terrible. Let’s check the QQ-plot to test for the normality of our residuals.

fig = sm.graphics.qqplot(model.resid, dist=stats.norm, line=’45', fit=True)
QQ-Plot of weight vs. mpg model

Generally speaking, the closer our blue plot is to the red line, the more normally distributed our residuals are. It looks like normality is being violated, especially at each tail of the data. Let’s see if transforming the dependent variable, mpg, improves our model and makes our residuals more normal.

data[‘mpg_log’] = np.log(data[‘mpg’])
outcome = 'mpg_log'
predictor = 'weight'
formula = outcome + '~' + predictor
model = ols(formula=formula, data=data).fit()
model.summary()

Our R² value increased significantly to .77! Now let’s check our QQ-plot:

The normality of our residuals also improved! We now meet the assumptions of linear regression and the accuracy of our model increased. Now, how exactly do we interpret the coefficient when the y-variable is log-transformed?

Since our y variable has been log-transformed, performing the inverse function should bring us the proper coefficient. To do this, exponentiate the coefficient, subtract 1, and multiply by 100 to obtain the % change in y brought about by a 1-unit increase in the x-variable. Thus, our coefficient of -.0004 means that mpg decreases by .03% when weight changes by 1 unit.

Transforming Independent and Dependent Variables

In the first example, we log transformed the independent variable when our linearity assumption was violated, and in the second example, we log transformed the dependent variable when our residual normality assumption was violated. Naturally, it makes sense to perform both of these transformations if we experience both of these issues!

For this example, we are going to revisit our displacement variable. Let’s see if we can further improve our model by not only log transforming the independent variable but also the dependent variable.

To recap, here is the resulting model and QQ-plot after log transforming the displacement variable.

Our R² value did improve from .65 to .69, however, we still have room for improvement. The residuals of our model also seem to violate normality at the lower and upper tails. Next, let’s also transform the mpg variable and see if it causes any improvement.

data['mpg_log'] = np.log(data['mpg'])
outcome = 'mpg_log'
predictor = 'disp_log'
formula = outcome + '~' + predictor
model = ols(formula=formula, data=data).fit()
model.summary()

Our model has improved to an R² of .74! Our QQ plot also shows our residual normality improved.

As you probably guessed, our interpretation of the coefficients has changed again. When both independent and dependent variables are log transformed, the coefficient represents the % change in y for a 1% change in x. In our model, this means mpg decreases by .55% when displacement changes by 1%.

Hopefully these examples helped visualize when to log transform which variables and how to interpret the output of your transformed model. For more resources and proof on how these concepts work, these links are very helpful:

· https://data.library.virginia.edu/interpreting-log-transformations-in-a-linear-model/

· https://stats.idre.ucla.edu/sas/faq/how-can-i-interpret-log-transformed-variables-in-terms-of-percent-change-in-linear-regression/

· https://online.stat.psu.edu/stat462/node/85/

--

--