SEO A/B Testing With Predictive Analysis

Illustrated by Anna Sudit

A/B testing is a standard procedure for using data to inform decision-making in the tech world. Modifications introduced to a product can be compared to either another variation or some baseline, allowing us to understand exactly just how much this modification increases the value of the product or introduces risk. When we see significant positive results, we roll out the change with confidence.

This approach is pretty simple with conversion-based A/B testing: you compare the number of conversions occurring on two variations of a page. Conversions could be any measureable behavior — checkouts, link clicks, impressions on some asset, or reaching a specified destination. The math for this also works out pretty simply: a rate or mean calculation and a p-value calculation on that value.

But the methodology for A/B testing conversion rates doesn’t translate to an SEO context!

First, SEO-based modifications don’t usually affect user behavior; they affect how search bots rank the page. Second, it doesn’t make sense to create 2 different versions of the same page because search rankings are negatively impacted by duplicates. Third, you can’t directly compare test and control groups in an SEO context, because SEO behavior is different for every single page. Lastly, there isn’t a clear conversion metric. With SEO-based testing, in the absence of knowledge of the inner workings of Google’s ranking algorithm, we want to see if our changes affect the volume of traffic we receive from search services like Google. Thus in order to understand how SEO based changes affect traffic, we need to compare the performance after the change to the performance we’d expect. To do so we created SEOL (pronounced soul), the Search Engine Optimization Legitimator.

SEOL Oracle — Prediction vs. Reality

SEOL fits a forecasting model to historical data for a group of site pages in question. Once the model is fit, the performance of the group of pages is forecasted for the dates from the starting of the intervention (launch date), till the end of the test. Finally, when we have our forecast and we’ve collected data about actual performance, we perform tests to measure whether deviations from our expectation (forecast) are statistically significant. If we see significant results in the test group, either positive or negative, and we don’t see the same fluctuations in the control group, we can confidently say that our changes have made an effect. If both groups are affected in the same direction, even if the deviation is significant, we can conclude that the deviation was something systemic, and was not caused by the update in question.

This is typically a 3 part process: group selection, performance forecasting, and significance testing. I’d like to explain each one of these steps in more detail.

Group Selection

Before we can perform this analysis we must first select stories to be in the test and control groups. We want to select groups that are the most statistically similar — you can think of this as selecting groups with the highest Pearson correlation coefficient. In addition to being correlated, you also want them to be in the most similar scale possible, e.g. the minimum Euclidean distance between the two time series (using time step as a dimension). In simpler language, select groups that look/perform the same, as much as possible. We must also be wary of selecting groups that have a similar number of stories in them, each of which perform proportionately similar as well. If the test group has 10 stories, and 5 of them contribute 80% of the performance (ideally each story contributes an equal amount), then there should also be 10 items in the control group, 5 of which contribute 80% of the performance. In addition, it is also important to avoid seasonally driven stories that could act as outliers, throwing off the test results. The goal of group selection is to identify groups that perform similarly and won’t have any unexpected perturbations to group performance.

SEO Performance Forecasting

The task of forecasting a group of pages performance breaks down into two parts:

  1. separating the performance data we already have into its behavioral cadence and metric trends;
  2. modeling/forecasting in each component, and recombining them.

Let’s start with discussing the decomposition of the performance data.

Below is an illustration of the signal decomposition results. We decompose the signal so that we can model the underlying shape that characterizes the general behavior in the data. This is called making the data stationary (you can learn more about that here). There are various approaches to this, but we stationarized the data by taking a rolling mean of the data, in which any given point represents average of some previous time periods (e.g., a week), and then subtracting that moving average from the vanilla signal. Thus, perturbations due to the ebb and flow of story popularity are cancelled out, leaving behind the natural cadence of weekday/weekend behavior. The moving average is also plotted, on the right, because this is the trend component. The trend represents the popularity of the group’s stories over time.

Now that we’ve decomposed our signal, we have to forecast performance from the intervention date until the end of the dates in the data set. This is done using regression methods. We used Polynomial Regression to model the stationary signal, and linear regression to model the trend.

Once we’ve made forecasts on our decomposed trend and stationary signals, we simply sum them to recover our original data. Now that we have our performance data and forecasts, we can perform our statistical tests. Note that we also model what our predictions look like plus and minus 1 standard deviation. This helps us get a sense of the kinds of variations that should fall into an acceptable range.

Forecast vs. Reality Statistical Significance Testing

Now that we have our data representing the reality of how the pages performed as well as our forecast for what we expected in that same time frame, we can perform statistical significance tests to see whether our expectation meets the reality. For this task we use 2-sided paired t-tests. The paired is designed for before/after testing. Here we use the forecast as our ‘before’ data, and the reality as our ‘after’ data. Here’s a bit more detail:

“Examples for the use are scores of the same set of student in different exams, or repeated sampling from the same units. The test measures whether the average score differs significantly across samples (e.g. exams). If we observe a large p-value, for example greater than 0.05 or 0.1 then we cannot reject the null hypothesis of identical average scores. If the p-value is smaller than the threshold, e.g. 1%, 5% or 10%, then we reject the null hypothesis of equal averages.”
Quote found here (I’m using this exact software to perform the test)

In addition we also test to see if the reality equals forecast plus/minus 1 standard deviation, and plus/minus ½ standard deviation to get more precision on how the performance actually turned out. So when we perform our paired t-test, for example, comparing reality & expectation + 1 standard deviation, we receive a p-value. If the p-value is below 0.05 then we reject the hypothesis that they are equal, otherwise we accept. If the test group has a significant result that the control group does not also have, then we can conclude that there was indeed a real change in the test group.

Post-Analysis notes

After the duration of the test period, one may want to inspect how the stories in each group actually performed. If a story had an outlier performance, perhaps due to unforeseeable circumstances, then you may want to remove it from the analysis. For example, David Bowie content got a boost in the time surrounding his death. There’s no way for us to have predicted that, and hence it would alter how our groups perform against the expectation. If this is the case, find the story or stories needing removal, and run the test again. In addition, in the case of online publication, like our business, we found that it makes more sense to analyze only pages (stories) that were published before the test period. Older stories tend to have a trend they follow, making modeling more straight-forward and effective. Lastly, try not to perform these types of tests during periods with seasonal effects, such as a major holiday.

Results Overview

We’ve recently completed a big project as a department. Our goal was to move our slideshow template to a new technology stack. This stack included a new front and back end, making our pages faster, and adding in some new capabilities. But before we planned a full release, we wanted to conduct a test on a small set of stories to ensure our new template didn’t introduce any unanticipated negative impact on SEO.

So we chose test and control groups that included about 8 thousand stories, let them run for a couple weeks, and conducted our SEOL analysis. You can see those results below.

Test Group

Actual performance with expectation and +/- 1 standard deviation

  • performance = expected + std: pval = 1.05651517695e-05
  • performance = expected + std/2: pval = 9.58082513196e-09
  • performance = expected: pval = 0.310003313956
  • performance = expected — std/2: pval = 3.78494096065e-10
  • performance = expected — std: pval = 0.000193616202526

Control Group

Actual performance with expectation and +/- 1 standard deviation

  • performance = expected + std: pval = 0.000571541676843
  • performance = expected + std/2: pval = 1.28055991188e-09
  • performance = expected: pval = 0.303194513375
  • performance = expected — std/2: pval = 7.21701889467e-11
  • performance = expected — std: pval = 2.45454965489e-05

Interpretation

Both the test and control groups showed that we must accept the hypothesis that performance is at expectation. This means that the new template appeared to perform on par with the old template at garnering organic traffic. This was good news for us, implying that our work didn’t have negative effects on crucial search traffic. This was our goal!

Conclusion

In order to test SEO-related changes we need to evaluate how they perform versus some expectation. Because SEO relates to the behavior of people and products outside of our ecosystem, we can make guesses on how they will perform, and then test against that, but directly comparing them is not a valid test method. This gives us the capability to test any SEO based changes with great detail, and a reliable way to ensure we’re not launching products that hurt us.

So to anyone needing a better way to understand how SEO-based changes affect your products, try using an approach like SEOL.