An introduction to regression analysis for marketers — part 2

Techniques for when the data isn’t a straight line…

Published in

Marketing And Growth Hacking

11 min readOct 21, 2018

If I were to recommend one technique for marketers who want to understand their data better, it would be linear regression. It’s quick, it’s relatively easy to get to grips with and, as I hope I got across in a previous article, the code to perform a regression analysis in R is very straightforward.

In this article, we’ll expand on part one and look at a real-world dataset from a pay-per-click campaign to explore what we can do when our dataset starts to curve away from a straight line.

For further adventures where marketing meets data science, follow Chris on Twitter.

The limits of the straight line

In the introduction to regression article and the article on investigating display campaign performance, the datasets we used produced a set of points that could be reasonably well represented with a straight line. In the real world, with real datasets, this may well not be the case every time.

If you’re familiar with the trendline options in Excel, you’ll know that there are a number of trendline models available, including linear, power, logarithmic and polynomial. Selecting one of these can often give you a line that fits the data better. Let’s look at a dataset from a PPC account where a straight line isn’t the best fit and look at how we can perform a regression in R.

While eyeballing a PPC account on a particular paid search platform, it looked as though on the days when I had increased my budget, my cost-per-click was increasing as well. I wasn’t getting anything else for that spend, average position was about the same and I can’t imagine that the increases in my modest campaign spends were sufficient to cause a noticeable supply and demand problem across the platform, so I thought I’d take a closer look.

For this article the data have been altered, but the relationship between spend and CPC remains the same. We’ll start by loading the libraries we’re going to use, importing the data and having a quick look:

library(broom)
library(ggplot2)
library(readr)
library(dplyr)# import data
cost_cpc <- read_csv("cost_cpc.csv")
glimpse(cost_cpc)> glimpse(cost_cpc)
Observations: 153
Variables: 3
$ cost    <dbl> 38.794, 43.010, 47.906, 69.734, 87.890
$ clicks  <int> 21, 22, 24, 35, 36, 29, 37, 33, 28, 35, 26
$ ave_cpc <dbl> 0.54, 0.57, 0.59, 0.59, 0.72, 0.60, 0.59

We’ve got a simple dataframe with three variables — cost, clicks and ave_cpc — and 153 observations representing 153 days of PPC activity. Let’s go straight in and create a linear model and plot our regression line over a scatter plot of the data:

# linear model
lin_mod <- lm(ave_cpc ~ cost, data = cost_cpc)
summary(lin_mod)Call:
lm(formula = ave_cpc ~ cost, data = cost_cpc)Residuals:
     Min       1Q   Median       3Q      Max 
-0.19765 -0.06617 -0.01428  0.06364  0.27773Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.4741745  0.0152668   31.06   <2e-16 ***
cost        0.0017947  0.0001587   11.31   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.09939 on 151 degrees of freedom
Multiple R-squared:  0.4585, Adjusted R-squared:  0.4549 
F-statistic: 127.8 on 1 and 151 DF,  p-value: < 2.2e-16
# augment and plot
aug_lin <- augment(lin_mod)
ggplot(aug_lin, aes(cost, ave_cpc)) + 
  geom_point() + 
  geom_line(aes(y = .fitted)) +
  labs(title = "Average CPC against Cost - Linear model",
       caption = "Adjusted R-squared:  0.4549 ",
       x = "Cost",
       y = "Average CPC") +
  theme_minimal()

I think we can see that, while the summary of our linear model tells us that our coefficients are statistically significant, I quick look at the plot tells us that a simple straight line representation of the relationship isn’t the right one to go for.

Along came poly

As we can in Excel, we’ll add a curve to our line by fitting a polynomial model. Polynomial models add additional parts to the equation of our line, adding complexity to the shapes of line we are able to draw. We can add the square of our x value to build a quadratic model, the cube to make a cubic model and so on, but going beyond a cubic model may well mean that you might be heading down the wrong avenue with your type of analysis and might mean you need to consider another method.

While our straight line has the general equation y = ax + b, our quadratic model adds an x² term to give us a general equation of y = ax² + bx + c. By changing the values of a, b and c, and making changing signs from add to subtract, our model can build a lot of different shapes of curve to try and get a better fit to our data.

To do this in R, we first create an additional column in our dataframe that contains the x² value. In our case, this is simply a column that contains the value of the cost squared:

# quadratic model - create squared column
cost_cpc$quad_cost <- cost_cpc$cost^2

This done, we can build our model as before, but simply add the additional term to our model:

# build quadratic model
quad_mod <- lm(ave_cpc ~ cost + quad_cost, data = cost_cpc)
summary(quad_mod)Call:
lm(formula = ave_cpc ~ cost + quad_cost, data = cost_cpc)Residuals:
      Min        1Q    Median        3Q       Max 
-0.201105 -0.072335 -0.006016  0.058538  0.301007Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.874e-01  2.746e-02  14.107  < 2e-16 ***
cost         3.633e-03  5.167e-04   7.032 6.72e-11 ***
quad_cost   -6.894e-06  1.853e-06  -3.721  0.00028 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.09516 on 150 degrees of freedom
Multiple R-squared:  0.5057, Adjusted R-squared:  0.4991 
F-statistic: 76.74 on 2 and 150 DF,  p-value: < 2.2e-16

Looking at the model summary, we can see that we’ve improved our model fit (according to the R-squared figure) from 0.4549 with the linear model to 0.4991 with the quadratic model. By adding the quadratic term, we can explain an extra 4% of the variability in our dataset. Looking at the coefficient of our quadratic term quad_cost, we can see that it’s contribution to our model is statistically significant. By adding our fitted values and plotting:

# augment and plot
aug_quad <- augment(quad_mod)
ggplot(aug_quad, aes(cost, ave_cpc)) + 
  geom_point() + 
  geom_line(aes(y = .fitted)) +
  labs(title = "Average CPC against Cost - Polynomial (quadratic) moodel",
       caption = "Adjusted R-squared:  0.4976 ",
       x = "Cost",
       y = "Average CPC") +
  theme_minimal()

… we can now see that we have a line that looks to be a much better fit for our data. What happens if we go to a cubic model? Again, we create a new column that adds the cubed term and add it to our model:

# cubic model - create cubed column
cost_cpc$cube_cost <- cost_cpc$cost^3# build cubic model
cube_mod <- lm(ave_cpc ~ cost + quad_cost + cube_cost, data = cost_cpc)
summary(cube_mod)Call:
lm(formula = ave_cpc ~ cost + quad_cost + cube_cost, data = cost_cpc)Residuals:
      Min        1Q    Median        3Q       Max 
-0.197066 -0.070269 -0.009998  0.053463  0.310063Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.062e-01  5.252e-02   5.831 3.29e-08 ***
cost         6.300e-03  1.562e-03   4.033 8.76e-05 ***
quad_cost   -3.008e-05  1.296e-05  -2.321   0.0216 *  
cube_cost    5.387e-08  2.981e-08   1.807   0.0727 .  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.09445 on 149 degrees of freedom
Multiple R-squared:  0.5163, Adjusted R-squared:  0.5066 
F-statistic: 53.02 on 3 and 149 DF,  p-value: < 2.2e-16

According to the R-squared, we’re now explaining 1% more of our variability that our quadratic model, but the significance of our cubic term is > 0.05. Yes, you can argue about the arbitrary use of 5% significance, but it’s a reasonable place to start. However, we won’t throw out our cubic term just yet just because it’s p-value is 0.07. Let’s see what the line looks like on the data:

# augment and plot
aug_cube <- augment(cube_mod)
ggplot(aug_cube, aes(cost, ave_cpc)) + 
  geom_point() + 
  geom_line(aes(y = .fitted)) +
  labs(title = "Average CPC against Cost - Polynomial (cubic) moodel",
       caption = "Adjusted R-squared:  0.5052  ",
       x = "Cost",
       y = "Average CPC") +
  theme_minimal()

I don’t know about you, but that line worries me. It might capture that cluster of points in the top right a bit better, but when I look at those data, I see a curve that is flattening off, not one that’s about to head off upwards again.

Going non-linear

To my eyes, these data look like they’re following a logarithmic curve. How do we model that in R? To do these, we need to head into the world of non-linear regression, leaving behind our lm function and using the nls function instead. If you’re interested in non-linear regression, I would suggest spending a bit of time scouring the web for some further tutorials that focus on non-linear regression in your field of interest, but we’ll take quick look at non-linear regression here to give you a taste.

The main difference with non-linear regression in R is that we need to be a lot more explicit in specifying the equation of our model up front; it’s not as simple as adding our terms together to our linear regression function and letting R work it out. What we can do though, is have a best guess and, for the purposes of this introduction, let R use its default values:

# build non-linear model
nls_mod <- nls(ave_cpc ~ a * cost / (b + cost), data = cost_cpc)Warning message:
In nls(ave_cpc ~ a * cost/(b + cost), data = cost_cpc) :
  No starting values specified for some parameters.
Initializing ‘a’, ‘b’ to '1.'.
Consider specifying 'start' or using a selfStart model
summary(nls_mod)Formula: ave_cpc ~ a * cost/(b + cost)Parameters:
  Estimate Std. Error t value Pr(>|t|)    
a  0.92903    0.03477  26.723  < 2e-16 ***
b 33.58116    3.96695   8.465 2.08e-14 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1Residual standard error: 0.09426 on 151 degrees of freedomNumber of iterations to convergence: 6 
Achieved convergence tolerance: 1.028e-06

The first thing to spot is that we don’t get a value for R-squared, so we can’t use that to compare our non-linear model with our linear models. If we plot our model:

# augment and plot
aug_nls <- augment(nls_mod)
ggplot(aug_nls, aes(cost, ave_cpc)) + 
  geom_point() + 
  geom_line(aes(y = .fitted)) +
  labs(title = "Average CPC against Cost - Non-linear moodel",
       caption = "Residual standard error: 0.09449 on 151 degrees of freedom",
       x = "Cost",
       y = "Average CPC") +
  theme_minimal()

… we can see that it is the sort of shape that we’re going for, so maybe we’re on the right lines and we could spend some time working with this model to see if we can improve it.

Measuring better

But, without a simple R-squared value, how can we compare our models to find out which really is the best? As we found with the cubic model, the R-squared was higher, but the model just didn’t look right. There is more to assessing how good a model is than the R-squared: your knowledge of the underlying mechanism, or simply what works intuitively are also important.

What we can do though, is a statistical test called an analysis of variance. This is often used for comparing differences in the means of multiple groups, but we can use it here to compare our models. The code is quick and easy, so let’s compare our linear and quadratic models:

# Analysis of variance
# linear against quadratic
anova(lin_mod, quad_mod)
Analysis of Variance TableModel 1: ave_cpc ~ cost
Model 2: ave_cpc ~ cost + quad_cost
  Res.Df    RSS Df Sum of Sq     F    Pr(>F)    
1    151 1.4917                                 
2    150 1.3584  1    0.1333 14.72 0.0001833 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Our test says that there is a statistically significant difference between these models; what about our quadratic against cubic?

# quadratic against cubic
anova(quad_mod, cube_mod)
Analysis of Variance TableModel 1: ave_cpc ~ cost + quad_cost
Model 2: ave_cpc ~ cost + quad_cost + cube_cost
  Res.Df    RSS Df Sum of Sq      F  Pr(>F)  
1    150 1.3584                              
2    149 1.3292  1  0.029141 3.2665 0.07273 .
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

No significant difference. What you may have noticed is that the significance level is the same as the significance level of the cubic term when we looked at the coefficients earlier. In this case, we would likely reject the cubic term and go with the quadratic model. When in doubt, keep it simple! This doesn’t really work for comparing our linear model to our non-linear model, so we have to employ some investigation and some intuition.

What we can do is create scatterplot of the actual values against the fitted value predicted by the model. Our augmented objects contain what we need, so it’s quick and painless to do using R’s base plotting function:

plot(aug_quad$ave_cpc, aug_quad$.fitted, main = "Quadratic model")
plot(aug_nls$ave_cpc, aug_nls$.fitted, main = "Non-linear model")

It might not be the simple, look at the one important number type of answer we might hope for, but I think the non-linear model gives us a nicer straight line across the whole of the dataset, so I’d be tempted to go with that.

Ultimately, it comes down to a combination of the statistics, your own understanding and intuition of the problem and the field, and what is likely to work in practice. You might be able to build a great model that predicts a customer’s future spend, but if it takes a ten minute survey of each customer to get the data you need, it’s unlikely to work in practice.

Regression is a great technique for delving into marketing data. If you have a complex mix of marketing activities, regression can help understand their effectiveness. Hopefully, these two articles have demonstrated that performing regression analysis in R is fairly straightforward (and free!), so why not install the software, import your data and see what insight you can pull out?

What does this mean for my dataset? It certainly looks convincing that my average CPC goes up when my daily budget does; time for some more investigation and perhaps a chat with the account manager.

Thanks for reading The Marketing & Growth Hacking Publication

Follow us on Twitter and LinkedIn. Join our Facebook Group. Contact us for a sponsored post. Write for us. Get early access to our job board, GrowthJob.

If you enjoyed this story, please recommend 👏 and share to help others find it!