Why we use experimentation quality as the main KPI for our experimentation platform
At Booking.com, we are proud of our data driven culture and extensive use of A/B testing to drive product decisions. Over the last 16 years, we have continuously invested in building the best possible experimentation platform. We have scaled and democratized experimentation to the point where today most product changes across all of Booking.com are first exposed to end-users through our experimentation platform.
Product teams use our experimentation platform as input to guide product decisions and measure impact on their key metrics. However, as the product team in charge of the experimentation platform, how do we in turn make decisions on improving our own product, the experimentation platform? How do we show and quantify our impact on the business? What should be our north star metric, our KPI?
Finding a good measure of experimentation success
A natural KPI for a tool is its usage or market share. For an A/B testing platform, this could be the number of experiments per week, or the percent of product teams running tests. This KPI is particularly important when a company is starting out with experimentation, as a measure of the growth in experimentation culture and traction of the tool. While we do track this ‘experimentation market share’, it is not our main KPI today. Since Booking.com’s experimentation culture is already quite mature, it is more of a vanity metric — somewhat equivalent to our product teams velocity. It tells us very little about how good our experimentation platform is or what value it brings to our business and our users.
Another commonly used metric is user satisfaction. Do experimenters like our experimentation platform? Do they find it easy to use? This is certainly an important metric: If the tool isn’t usable, people will not use it. While it is always nice to make the users of our platform happy, we no longer directly track user satisfaction. As an internal tool, we are in constant contact with our users — our colleagues — via support, training, informal chats, etc. However, in general, we believe that user satisfaction does not get to the core of what an experimentation platform should do. Although we want our users to be happy, user satisfaction alone is not our main driver.
The most important aspect of product success is that it solves its core customer problem. In our case, this core problem is making reliable evidence-based product decisions using experimentation, both on the tactical and strategic level. What really matters to us is not how many product decisions are made, nor how fast decisions are made, but how good those decisions are.
While using experimentation as part of product development and decision making is common practice nowadays, it does not, by itself, guarantee that good decisions are made. Executing experiments correctly can be difficult, and the data obtained from an experiment is only as reliable as the execution of the experiment itself. Running bad experiments is just a very expensive and convoluted way to make unreliable decisions.
Running bad experiments is just a very expensive and convoluted way to make unreliable decisions.
So how can we measure whether our experimentation platform facilitates good product decisions? What is a “good” product decision in the first place?
So how do we measure good product decisions?
We believe that good product decisions are those which support our company values and objectives. We also believe that decisions should be based on reliable evidence. To achieve these goals, we want to ensure that experiment decisions are made through a standardized (but not necessarily enforced) process which reflects our values and helps us eliminate biases and subjectivity, so that those decisions are trustworthy and repeatable.
We understand that this standardized process will not be perfect, but we aim for deviations from the process to be the exception rather than the rule. By measuring the extent to which experiment decisions made by product teams are in accordance with the agreed upon process and decision rules we can measure to what extent we are meeting these decision quality standards.
For example, standard experimentation protocol requires that we specify in advance, as part of a hypothesis, which metrics we will consider when making a decision, and in what direction we expect to see those metrics move. Our experiment platform allows product teams to register these metrics and expectations as part of the experiment configuration. If our observations on the metrics at decision time do not match with the pre-registered expectations, then they cannot serve as evidence to support our original hypothesis. In the event that the product team decides to ship the experiment, regardless of this lack of supporting evidence, then the decision can be said to be in violation of the standardized hypothesis testing protocol. While this might not be a major issue for individual experiments or decisions, this can, on aggregate, put our overall decision quality at risk.
For the initial version of our experiment decision quality KPI, we defined a limited set of rules in three different categories (somewhat similar to the three key checklists defined in this paper) that we felt were most pressing to address.
The three categories we are currently most interested in are Design, Execution or Shipping:
- The Design category checks for things which happen before the start of an experiment. For example, we check whether a power calculation was done, and whether the expected outcomes on decision-making metrics were pre-registered.
- The Execution category is mainly about the planned experiment duration and the adherence to that plan.
- Finally, Shipping validates that the decision is in line with the shipping criteria.
We understand that doing well in these three categories does not necessarily mean the experiment is perfect (this is not our aim here), but simply that it does well against our current definition of quality.
Rule adherence in these three categories is aggregated into a single three-point rating that helps us indicate relative quality of our experiment decisions. These three-point ratings can be aggregated to team or department level to track performance over time.
This definition of quality as adherence to a system of rules is naturally flexible and easily extensible. In later iterations of the KPI, we plan to expand upon these categories by adding more decision rules, as needed, to reflect our ever-evolving practice of product decisions.
How do we use this KPI?
Our experiment decision quality KPI can be used in several ways — firstly, as an internal KPI for the product team in charge of building the experimentation platform. We track this KPI over time to see whether our overall performance is increasing or decreasing. This gives us a sense of how the behaviour of our users is shifting as the organisation around us evolves. We can break down the results for different departments to identify potential areas of the organization that need more attention and support.
We can also tease out which specific rules have the biggest impact on the overall score. This helps us identify which parts of the experimentation protocol our users are struggling with the most, which in turn helps us identify potential areas of improvement for our tooling. We can then make changes to our experimentation platform and monitor whether changes in the platform increase adherence to specific rules. In a future blog post, we will dive a bit deeper and give examples of how we have used this quality KPI to improve our experimentation platform and ultimately to facilitate better product decisions throughout the company.
Secondly, we can use this KPI as a user-facing indicator. To help the users of our platform (our colleagues), we can give them direct feedback on their performance against this KPI. Feedback can include the specific rules that most affected their score, and to help them understand how they can improve the quality of their decisions. Departments, teams or individuals might even choose to define objectives around this metric if they feel that this is necessary. Leadership can also use this indicator to quickly identify individual experiment decisions that might require additional review.
As mentioned before, this is an early iteration of an ever-evolving definition of quality with a focus on what matters to us most at the moment. We plan to add more rules to the system based on insights gleaned from the use of the KPI in practice. For example, changes made to our experimentation platform that were informed by the KPI might in turn result in changes to the definition of the KPI itself.
The expectation is not that this KPI will be a perfect measure of experiment decision quality. Perfect is the enemy of good. Instead, this KPI helps us not only to improve our own experimentation platform but the overall quality of all our product decisions using experimentation, including our understanding of what constitutes a good quality decision in the first place.
The ideas expressed in this post were greatly influenced by our work on the in-house experimentation platform at Booking.com, as well as conversations with other online experimentation practitioners. A big thank you to all our current and former colleagues who have contributed in a way or another. Thanks to Carolin Grahlmann, Lin Jia, Liam Furman & Nils Skotara for their work on experimentation quality. Thank you Jadvinder Daudhria for the designs and Kristofer Barber for reviewing this article.