Experimentation: How we kicked start headline and image testing

Sabine Kejlova
News UK Technology
Published in
5 min readJan 22, 2024

Being a part of the Data Tech’s experimentation team has a great perk: we often meet colleagues from many different parts of the business, and with that we also get an opportunity to work across a range of interesting projects. Last year was particularly exciting in that regard, as for the first time we teamed up with The Sun’s and The Times’s editorial teams. The project was homepage headline and image testing and, although long on newsrooms’ wishlists, it wasn’t something we had done before.

The initiative required effort and dedication from everyone involved. For three days running, page editors had to be on a lookout for testing opportunities and supply us with headline alternatives. But before we could even proceed with the experimentation, it was our team’s task to figure out how to best approach it from a technical, strategic, and analytical perspective.

I will now share details on how we tackled these three areas.

Custom built headline testing extension in Optimizely

Technology: Building a reusable extension

As we did not have a tool which specialises in headline testing, we set out to build a solution within our existing experimentation tool — Optimizely. The solution had to provide the following:

  • Ability to launch experiments quickly to keep up with the article publishing
  • Reliable way to set up custom events to isolate clicks on tested headlines
  • Zero risk of overwriting an incorrect headline should the homepage position of an article change

Thankfully, our engineer Dani met all these needs by creating a custom extension. The extension targeted headlines using HTML attributes, hence it didn’t pose any risk of ‘experiment spilling’. And with its clearly defined fields for alternative headlines, image URLs, and custom event names, it also worked as a reusable template which we simply cloned and updated every time we had a new experiment. That way, even a relative experimentation novice could get a new test up and running in well under five minutes.

Experiment control: Teased headline and image
Experiment variant: Headline and image reveal the content

Strategy: Grouping headlines into categories

When deciding on what to test, a lot was on the table. Therefore, our main challenge lay in giving the experimentation a structure and establishing priorities. To narrow down the focus, we proposed grouping headlines into categories based on how they are written.

For example:

  • Questions vs statements
  • Simple vs detailed
  • Using first person narrative vs not

After editorial teams put together a list of these categories, we worked with them on selecting the top four that are most frequently used and concentrated on those.

We also planned minimum and maximum running times in advance. Traffic and engagement on our titles’ homepages are substantial, so we trusted that some experiments would reach statistical significance quickly. But as much as it was important to get conclusive results, we also needed large data sets for in-depth analysis. For that reason, we decided to keep experiments running for at least one hour even if they reached significance sooner. Conversely, experiments that had small engagement and proved to be slow burners wouldn’t run longer than four hours, as they would lose relevance and muddle the data.

Analysis: Comparing click through rate

Our core aim of the analysis was to establish which headline variants performed better within their given categories, and to assess validity of the overall learnings. After consulting the Data Science team we decided to achieve this by:

  • Calculating a cumulative click through rate for each categories’ variants
  • Establishing the percentage variance
  • Using a confidence calculator to determine statistical confidence of a result

On The Sun we received conclusive results for three out four categories, while on The Times we got a definite answer for only one out of four categories. When exploring what could impact the learn rate (percentage metric which captures the number of statistically significant winning and losing experiments over the sum of inconclusive experiments), we found that a level of contrast between variants, position on the page, and a potentially subjective reading of categories definitions all played a role in our chances of attaining conclusive results. Additionally, sample sizes among tested categories weren’t equal, so some headline themes did not gather as much data as others.

In addition to the outcomes on categories testing, we also analysed:

  • Overall conclusive rate
  • Difference in results between desktop and mobile
  • Maxima and minima
  • Character length variables

We calculated the median time to reach results, too, which was between 40 to 60 minutes depending on the title. This was a useful finding, as some headline testing tools that are available promise results in a shorter amount of time, which could bring false discoveries.

Headline themes sample size

What next

The main objective of this initiative was to evaluate the general size of the opportunity for landing page headline and image testing, and to validate some of the specific headline writing styles. At large, the pilot was successful from both perspectives, although results and impact differed between the titles. Still, we walked away with several options on how to carry on with the project beyond its first phase.

For example:

  • Providing training for page editors, so they can continue with headline testing within our current tool before a longer term solution is worked out.
  • Looking into the macro impact on site-wide article views as some experiments showed promising results in this area.
  • Revisiting the categories which didn’t reach conclusive results and to search for new, more prolific angles.

In all three scenarios, editorial needs and feedback will be the key drivers along the way.

--

--