Recommendations Part 2: Our experiment to test personalized calls to action

Published in

LocalAtBrown

4 min readApr 22, 2021

How we created our experiment and how we will know it has been successful (or not)

Here we are, about to launch our first experiment! We’re excited at the prospect of seeing our recommendation module in production, helping our current and future newsroom partners with audience engagement and content discovery while learning from the data we capture from it. To recap our last post, the idea is that the articles recommended by the module will help us create smarter and more personalized calls to action which should then (we hypothesize) lead to increased engagement and revenue. In order to test these hypotheses, we came up with this experiment: a recommendation module to be placed on article pages of a newsroom partner’s website.

How did we create the experiment?

We coordinated with the audience and data team at a local newspaper based in Washington, D.C. to design the parameters for the experiment. In our initial meetings, we identified both a need and an opportunity to solve it. We learned our partner did not have sufficient insight into their readers’ exact interests, e.g. why they favored certain kinds of articles over others, and what the different categories of appeal (local news, restaurants, relationship advice) might be. So the need was to gain insight into readers’ interests and behaviors in order to better serve them. Further, our partner’s goal was to direct readers towards articles that would deepen their brand relationship, rather than simply guide them towards high-click but potentially “low-value” (or less relevant) content. This presented the opportunity to create a “more articles to read” module, populated using smarter logic, that would replace their existing “recent articles” module (based on publish time), while also providing a deeper understanding of their readers.

Next, our UX engineer and engineering lead collected live streaming user data from our partner’s website, using Snowplow Analytics to track streaming Google Analytics data as it came in. Meanwhile, our senior data engineer and I (a data scientist and machine learning engineer) set up a system that could train off the live data and compute new recommendations on an hourly basis. Finally, we designed the front end module to display recommendations for each article. For this, we mocked up the planned module using a Tampermonkey script to replace the “Recent articles” module to easily present and test our work. With approval from our stakeholders, we started to build the final module for deployment.

How will we know the experiment was successful (or not)?

We’ve designed the module with analytics tracking to monitor key metrics like click through rate and dwell time on recommended articles. We will run the experiment for a duration of roughly two weeks (which we’ve calculated would be the bare minimum, based on their daily readership, to reach statistical significance). We will know we were successful if the new module shows greater engagement than the old one did (we will analyze a month’s worth of this weekly data to ensure we have solid lessons learned).

A short-term metric of success will show increased reader engagement. We’ll analyze weekly data over a month-long period to ensure we’ve gained enough valuable knowledge to inform future modules. A longer-term metric we’re planning to look at carefully is how often readers donate after receiving a recommendation-based experience.

How long will we be running the experiment and how will it scale?

We are planning to run the experiment for at least 2 weeks to get a sense of how the module affects the user experience and engagement (we will be employing a “canary deployment strategy” which involves deploying to a small batch — 0.1% — of users in our first launch in order to reduce risk and more easily roll-back in the case of errors). We will also be testing our optimization framework using an A/A testing strategy (which should return flat results to show us that the framework is statistically fair). Then we will make refinements as necessary, and scale the approach by inviting other newsrooms to participate by deploying similar recommendation modules on their sites. We made sure to design as simple a system as possible, with versatile modeling and few dependencies on specific kinds of data, so that it may be reused and redeployed on other similar news websites.

Finally, we’re excited to test out multi-armed bandits to optimize some of our model’s parameters as we deploy it on-site. This class of algorithm is worth knowing about as an alternative to A/B testing, as it optimizes algorithmic decision-making automatically over time, without the need for human intervention in assessing experiments and choosing options (i.e. option A or option B in A/B testing). However, this comes with the risk of over-optimizing for a specific target metric, so it’s important that any multi-armed bandit deployment be matched with proper monitoring of the system as it evolves.

We will be posting again once we have our experiment launched and can share what we’ve learned.

Recommendations Part 2: Our experiment to test personalized calls to action

How did we create the experiment?

How will we know the experiment was successful (or not)?

How long will we be running the experiment and how will it scale?

Written by Eric Bolton