Consider Netflix. They provide a platform with video content that people watch. What might be their primary KPI for online A/B tests? Engagement, such as minutes watched, # videos watched, or proportion of videos that people watch more than 5 minutes? Well, it turns out that their primary metric for their full consumer science experiments (rather than initial, quality of experience) A/B tests is not engagement but retention: do members retain better after the free trial month ends and in subsequent months. The assumption is that if people are happy enough to renew their subscription, they are probably engaging with, and finding value in, the product. Retention, which is one of the core metrics for any subscription service business, makes sense but it is not clear that these high level metrics are common in such A/B tests. Tests tend to focus on more lower-level specific, behavioral, and actionable metrics: click, share, watch, post, or like, rather than revenue (coming from buy, renew, or sign up). Which are the right ones to optimize and under what circumstances?
North Star versus Sign Posts
It feels that there is a qualitative difference between a metric such as “retention” versus “watch video.” One perspective is that they are strategic metrics versus tactical metrics. You can directly drive watch video, a tactical metric, but it is generally harder to directly drive retention, a strategic metric, excepting circumstances closely tied to that action, such as optimizing checkout out flow, pricing, and sign up pages. Michael Korcuska likes to call them “North Star” (strategic) versus “sign post” (tactical) metrics.
Here are some examples of North Star (“NS”) versus sign post (“SP”) metrics for different business models:
An interesting example are subscription services with a very clear achieved / not achieved goal: e.g., in serious long-term dating apps such as Match or eHarmony (rather than hookup sites), when a member finds a life partner they (in theory) have no further need for the service. Thus, revenue (NS) versus #dates (SP) makes sense but so does member success (NS) versus #dates (SP), if success and positive word of mouth (high net promoter score) drives growth through new membership.
Implications of North Star Metrics as A/B test KPIs
For Netflix, what does such a North Star retention metric imply? First, it means that their tests are long: at least one month, or, rather, one billing cycle, sometimes two or three. Second, it means that they have sufficient traffic to detect a significant difference in retention rates. The lower the base rate, the higher the required sample size to be able to detect a difference for a given effect size. Third, it means that they have sufficient instrumentation, good experimental design, and analytical chops to be able to tease out what are the drivers of the retention increase if there are multiple concurrent tests or confounding effects. The driver of a significant retention increase should never be a mystery.
It is worth flipping this around for a second. If you have low traffic, you are probably screwed whatever you do. Whichever metric you have will require a long A/B test to get sufficient sample size. If you have long billing cycles or other long or lagging metrics, this won’t work either. For instance, in recent years, WeWork has been growing their enterprise clients. While smaller, risk-averse startups prefer the flexibility of month to month commitment, it is the interest of WeWork to incentivize enterprise clients to sign up for longer contracts. Retention on say a 12 or 24 months contract is not going to work well as a metric to optimize. The lag is too long.
However, by focusing on cancellations rates, rather than retention per se we can in fact optimize for that, including the case where the commitment (say 90-day initial commitment) is longer than the desired A/B test period, say 30 day test. How?
Handling commitments longer than desired A/B test period
Conceptually, it is possible to examine cancellation rates in a 30 day test window when individual initial commitments are longer, say 60 and 90 days. It just requires some good experimental design.
Imagine that members can only cancel at the end of the month and we choose some end of test date (30th November) and then assume that we start the test 3 weeks before that, say 7th November. These 7 days (Nov 1–7) gives any late-paying members from 30th October time to renew. First, throw out any members whose initial possible cancellation date is after the end of the test. That is, don’t include those who just signed up with a 90-day commitment; they are never going to cancel within the test period. Second, use stratified sampling to ensure that the distribution of commitment length, and sign up month, is controlled for among the treatment and control channels (see figure below). With such a design, one can measure cancellations rates between treatment and control, and dive into any differences among the commitment length and initial sign up month.
Implications of Sign Post Metrics
In an e-commerce website checkout flow, one almost certainly wouldn’t optimize for add to basket, one would likely optimize for purchase. That’s what you ultimately want to drive in that flow. There are some reason behind this:
i) Perverse incentives
How do you make a bandstand? Take away their chairs!
It might be very easy to encourage people to add to basket (but not necessarily drive purchases), by removing other features. For instance, Amazon has “recently viewed” and “save for later” where users can keep track of what they have recently viewed or interacted with. Take those features away and you might see people using add to cart as a holding area but perhaps without strong intent to purchase. Obviously, you don’t want any friction to adding to basket as that will prevent checkout but optimizing add to cart does not necessarily mean optimizing for purchase (see asymmetry in matrix below).
Another example: it is very easy to encourage sign ups when you really want to drive revenue. How easy: give everyone coupons, discounts or other incentives to sign up. It doesn’t mean those that new signups are going to convert. (This is the same situation as hotels above: you can drive bookings but not drive revenue through heavy discounting and promotion.) Moreover, if there is a cost to membership — or a service such as hotel booking or airline seats by purchasing in advance — then ultimately the whole exercise costs more than not trying to increase signups.
Thus, you might drive perverse behavior by focussing on the actions that a team can directly control and drive (and be rewarded for). Kerr provides a number of examples of “fouled up” incentives.
ii) A Possible Solution
Imagine a situation in which one team’s responsibility, the visitor site team, is to drive traffic to a visitor site. A second, downstream, team’s responsibility is sign ups, the sign up team.
If you make the metric to optimize for both teams a revenue metric such as conversion what may happen? Finger pointing: the visitor team will complain that any decrease in conversion is not their fault, it is bad decisions and changes by the sign up team. Conversely, the downstream sign up team will complain that the upstream visitor site test are sending poor quality leads. Thus, this metric alone will not work.
OK, what if you make the visitors site teams’ metric the thing that they can control: # visitors. What will likely happen is that they are then incentivized to favor quantity over quality. The sign up team will then be grumpy because they can’t convert these leads. Thus, you cannot have this upstream team only accountable for the metric that they can directly control.
Here is a possible solution to consider:
What if you could carefully control and coordinate A/B tests between the two teams.
- the visitor site have two experiences: current state (A_visitor) and some new experience such as high volume, low quality (B_visitor).
- these two streams of visitors flow down to the sign up experience.
- the sign up team has two experiences: current state (A_signup) and some other experience (B_signup).
That is, this is a typical two A/B test situation, chained together sequentially.
Comparing results of conversion between the A_visitor → A_signup (AA) flow versus B_visitor → A_signup (BA) flow will tell you if the visitor team are having an ultimate effect on the business. That is, even though they only directly control volume to sign up, and not what happens downstream, you could assess them on sign ups, so long as A_signup is constant between A_visitor and B_visitor. That covers the visitor team. For the sign up team, it is more straightforward: comparing A_signup versus B_signup will tell you if the signup team are having an effect on the business.
iii) Optimize for wrong thing
When focussing on a sign post metric, there is no guarantee that the actions you are optimizing for will have the desired effect or is right thing to drive. You are likely to be more correct about North Star metric than a sign post metric as north stars will be central to the experience or business model.
- When Coke was testing new Coke (in 1980s) [Webber, 2006], they did lots of user testing using a sip test. People loved it over Coke and Pepsi. When they launched the product, it was a huge flop. Customers thought it too sweet and clawing. Why? We tolerate much sweeter foods in sips than in gulps. The sip test gave the wrong signal of how people in non test conditions consume the product. It was the wrong metric: it optimized for sips not sales.
- Another example [this is from Elizabeth Churchill’s keynote at RecSys 2018]: Yahoo has been around a long time, before video hit the web. Their early mindset was ad revenue and clicks. When video first came along, they optimized for clicks. That is what the organization was used to doing. It took them a while to realize that when people are most engaged with video, they are watching the content and not doing other actives such as clicking. Engagement was inversely correlated with clicks.
For a team to focus too heavily on a sign post metric might mean that they miss the larger perspective of stepping back and identifying and tackling larger opportunities. For instance, in a social network, people might be more likely to follow other members that have a profile picture (compared to people who don’t have a picture. Perhaps blank, default profiles look sterile, non-serious, and not worth following). However, perhaps the addition of a “who to follow” recommender has a significantly larger effect on following behavior than adding a photo. You don’t want to miss the opportunity of finding and exploring those opportunities. A higher-level metric should at least signal to the team that all avenues are open and worth considering.
Of course, this potential heavily depends on the team structure and their role and responsibility. If a team only controls a small portion of an app, a smaller experience such as a profile page , then that this is their world and other opportunities might be outside their control. Thus, it is the responsibility of those with higher level viewpoints and responsibility (C-suite, VPs, decision makers, PMs etc) to take stock of the complete scope of potential drivers.
Ideally, you will have these different teams all working on what they can control. If, however, these different approaches are within the control of a single team then one needs to make sure that they have the ability to optimize how they control their resources. “Increase number of profiles with profile photos” as a given KPI signals a particular strategy, especially if that is done post-hoc after qualitative or qualitative research. Ask the team to optimize a North Star metric, however, leaves it much more open.
In summary, sign post metrics have a definite place. You always need to understand drivers and they tend to be more direct, behavioral metrics that people can measure, understand, and more easily come up with feature enhancements. However, they offer potential to be misleading: you might successfully drive that metric without impacting the business’s bottom line. The business, however, will only thrive if its north star metrics are headed in the right direction. They are harder, but not impossible, to control and test but if one can do so, teams can then see and measure the direct impact on the business. As Galileo Galilei quipped, “Measure what is measurable, and make measurable what is not so.”