Onward journeys are really important to BBC Sounds. Building better journeys on our app allows our users to better explore the huge range of audio content we have on offer. In-episode trailing is a great way to recommend shows to listen to next — but how do we know if it’s working?
In an ideal world we’d do this via a standard A/B test, where we split users randomly into control and treatment groups. However in this specific case we’d also need to create two versions of the audio content, one with the trail and one without, requiring a significant amount of engineering resource.
Despite this, with a bit of extra analytical work, we can still measure the effectiveness of trails. This article will focus on how we applied this approach to a marketing campaign on That Peter Crouch Podcast, which trailed several different podcasts over a number of weeks.
Creating control & treatment groups
The critical feature of the podcast trails was that they occurred at the end of each episode. This might not necessarily be the ideal place to have trails, but it makes creating our testing groups much easier.
This is because it allows us to filter to users who completed at least 80% of the episode as a proxy for engagement, and then split this group into control and treatment based on whether they heard the trail or not. If the trails had occurred in the middle this wouldn’t work since users in the control would have heard less than half of the episode, so probably wouldn’t have been that engaged and therefore would not have been a suitable comparison group.
One key assumption
This creates our control and treatment groups, but unfortunately there is still a key difference between the two groups since treatment has listened to more content than control. Therefore, if we want to test this rigorously we need to first test the following assumption:
Listening to slightly more of an episode does not make you more likely to listen to the trailed content.
To do this, we conduct an initial assumption test, where we split users into ‘fake’ control and treatment groups as follows:
In this case, neither group has listened to the trail, and therefore we’re directly testing the assumption. If the assumption is true, we should measure no significant difference in conversion between the groups.
Now that we’ve set up our control and treatment groups, we define a framework for measuring success based on what fraction of users are converted to listen to the trailed content within 2 weeks of hearing the trail.
It’s important to ignore users who have listened to the trailed content in the 13 weeks (1 quarter) prior to hearing the trail. This is to stop existing listeners of the trailed content muddying the analysis.
Using the conversion definition we can therefore classify each user as:
- Or existing listener
Using the beta distribution, given by the parameters alpha (successes) & beta (failures), we can generate a probability distribution centred around our conversion rate:
Conversion Rate = Successes / (Successes + Failures)
Measuring impact is then a case of comparing the two beta distributions for control and treatment to see if there is any significant difference. The method I used is described in much more detail in this article.
The assumption test
Dealing with the assumption test first, fortunately we saw no significant results for any of the 6 episodes tested. It’s important to note this doesn’t mean we can definitely say there is no effect from listening to more of an episode, it instead means we can’t measure any effect at the accuracy enabled from our sample size.
The plot below shows a comparison of two beta distributions for ‘That Captains Episode’, which shows a large overlap in distributions.
The actual results
With our assumption tested, the actual results from the trails showed significant results across each episode. So basically, trailing works!
To give you an example of the effect size, using the same episode from the example above, we see the conversion rate has gone from 4.4% with no trail, to 6.3% with a trail. Whilst the conversion rate percentages are quite small in general, this represents an increase of 46%.
It was promising to see uplifts across the board since we trailed a range of content from other football specific content, such as Football Daily, to less obvious recommendations, such as Radio 1’s Scott Mills Daily Podcast. This can give our marketing teams more confidence in taking more risk in terms of recommending less typical onward journeys.
Finally, probably the most positive aspect of these results were that we saw larger increases, in some cases almost a 200% increase, in conversion rates for both our under-35 and infrequent audience. This makes intuitive sense since these users tend to be less familiar with our content, but as these are target groups for BBC Sounds it was great to measure this directly.
Generalising the method
To summarise the above approach, the keys steps to this method are:
- Create control and treatment groups, based on trail listening
- Determine what assumptions allow us to compare these groups as if they were a simple random split
- Define a conversion measure to split users into success or failures
- Test the assumption by comparing beta distributions
- Providing assumption is valid, again test the trail impact by comparing beta distributions
Future trail testing at BBC Sounds
This analysis has given us clear proof that our trailing works and defined a clear framework for future tests. With this measured, we now have the freedom to experiment and test out different trail hypothesis, such as:
- How effective is trailing podcasts or mixes from live radio?
- What kind of content journeys work the best?
- What format of trail is most effective at driving conversions
This allows our marketing teams to have more confidence in their campaigns and take the necessary risks required to broaden our audiences listening habits.