Automatic experimentation at Turo

Pierre Raccaglia
Turo Engineering
Published in
11 min readSep 6, 2018

“How come I don’t see this feature on my Facebook and you do?” is probably the most common contact we have with experimentation in our everyday life. You are wondering why your friends get to use this fancy filter on Instagram and you don’t? Why you can search by map on Airbnb, but your cousin can’t? Why your burrito is a complete mess while your colleague’s one is perfectly rolled? (okay, I admit, this may not be related).

You said experimentation?

All of this has to do with difference of treatment, also called experimentation. You might have also heard the word A/B testing, an incomplete but efficient summary of this concept: randomly distributing to customers a different experience, allowing us to compare the change in behavior, and hence be able to identify the winning strategy.

If Instagram offers you a new filter, will you be lost in front of all those options and quit the app, or will that increase your appetite to share this unperfectly rolled burrito with your followers? Obviously, depending on the result of the experimentation, the company can decide what to enshrine (what will be their new standard).

Even if it seems like the right and easier thing to do if you want to know what suits your customers best, there are a few caveats to this approach.

First, serving a different treatment to a different user can be more complex than it seems, in terms of architecture. Second, defining the metrics on which we will action upon can be tricky. Eventually, deciding the winner, the loser, or even the fact that both treatments are equals is often problematic.

Defining the experiment: architecture and metrics

An experiment can be summarized by its design and metrics. The design will incorporate different parameters, but mainly:

  • what is the difference in terms of experience (such as a UI change) between treatment A and B, B and C, etc.
  • what is the population in bucket A, B, etc.

On the metrics side, we define the values that we want to look at to be able to select a winning treatment. Instagram might go with number of stories, while our burrito maker will certainly pick revenue as his main metric.

At Turo, the design part will involve a combination of product managers and product engineers. The metric selection will involve a combination of product managers and data analysts or scientists. But the real interest of this post lies in the third point that we have not explored yet: how to decide if a treatment overperforms or underperforms on a given set of metrics?

Analyzing the experiment

A variety of software exists when we talk about A/B testing, and, more generally, experimentation. Tools such as Optimizely, Google Experiment or Split are used daily by lots of companies, including Turo. For years, at Turo, the monitoring of tests had led to ad hoc analysis made by data science, with many key steps duplicated for each test. Of course, few processes are needed, or even recommended, while you’re a small business with a few needs in terms of experimentation — hello burritos, scientists!

On the other hand, if you find yourself in the position of being the leader in your market, with a fully functional website, and multiple tests launched each week, you start to think about experimentation in a much more rigorous way, and the need for automation becomes real.

In such a situation, the most important thing is to iterate quickly, but with a proven positive impact of the new features and changes you are bringing to your product. It seems clear in that case that your experimenting game has to be leveled up: that’s exactly what we did last year, with the creation of our internal experimentation platform, TuroAB.

Experimenting at Turo

We’ve come a long way

Let’s go back in time. Summer 2017. No real experimentation framework is available for the teams at Turo. Every experimentation is handled by a data scientist or analyst. This is how it usually worked back then:

  • a stakeholder decides to tackle a specific project on the roadmap
  • with their team, they start working on the new feature.
  • they prepare everything for shipment
  • they come to a scientist / analyst, asking for a tracking of the experimentation
  • depending on the bandwidth, the resource pulls up a Jupyter notebook to be refreshed and screenshot (without any notion of time)
  • the Jupyter notebook can also be shared with the stakeholders to show calculation, which leads to obvious problems of co-editing or even worse, incorrect conclusions.

As you may expect, this is not scalable, for multiple reasons:

  • you need to re-compute all the metrics for each test, without ensuring consistency
  • manually refreshing the data everyday leads to inconsistencies in times and data updates
  • test design is one of the most important things in experimentation, and it is critical to align, before implementation, on a set of metrics that will determine a potential positive impact. If you are launching your feature to see what happens, you will always succeed at seeing what happens.
  • experimentation is for everyone. There is experimentation in product engineering, design, marketing, finance, trust and safety, customer support, etc. In order to cope with that, you need to have a reliable and easy-to-use tool to spread the word.

From those observations, the data science team decided to take a step back and invest time and effort in automating experimentation across the entire organisation.

How a unified platform with fixed rules helps us build the future of our platform

TuroAB is a number of things, hidden behind a very simple web app that displays the results of all present and past experiments running on our platform.

A call for unified test designs

First and foremost, building TuroAB has been a great opportunity to get everyone together and aligned on experimentation. Turo has always had guidelines for testing, but the presence of a real testing framework from the start to the end helps to improve structure for this important piece of business development. Architecture, naming, bucketing, and tracking were all discussed, refined and unified in order to be as efficient and clear as possible.

A harmonization of the metrics

Coming back to the first rule of testing: we need to compare comparable things. This is why metrics are crucial to a good and deep understanding of the results of an experiment. A well defined target for a well defined metric can save months, which, in the case of our platform, can represent a significant amount of revenue.

In terms of metrics, we had two main milestones in our approach. First, we built a unified table regrouping all the metrics available for testing on a given user. We were thinking optimization, and obviously wanted to avoid recomputing the same metric if two tests were running at the same time. Indeed, building a metric outside of a test is always the more robust and secure way to prevent false comparisons or errors in the metrics manipulations. But we rapidly faced a growing problem: our metrics were not set in stone, and most of them were very specific to the experiment.

We then adopted a second idea, which is the one currently used. Instead of querying a large, unique, table that summarizes everything and prevents bad queries, we built a custom query generator that harmonizes our way of computing even the more granular metrics. In concrete terms, for any type of metric, filter or partition we want, the query generator gives us the tables to query, how to join them and how to compute the metrics. This allows for more flexibility in terms of querying, more adaptability to specific requests, and also more integrability with 3rd party data we might want to use in less classic experiments (experimenting based on Google or Facebook analytics, experimenting with a new integration that is not yet fully integrated in our database, etc.)

Eventually, analyzing the results

Obviously, the most important part couldn’t be avoided: TuroAB is by definition a platform for analyzing the results of ongoing or past experiments. Here is how it looks like in real time:

If you are familiar with AB testing, you might be wondering “what are those fancy graphs doing in the middle of an AB test dashboard? Shouldn’t you show me some p-values and confidence intervals?”. If you are even more familiar with experimentation, you might recognize the Bayesian framework.

What is important to understand here is that once you have built your metrics, split your users in buckets and launched your experiment, you end up with multiple (let’s say 2, for the sake of simplicity) distributions of values. A simple example can be: Turo changing the background color of all the buttons of the website from green to purple. We have 50% of people that are still seeing green (our control group), and 50% of them are seeing purple. The main metric we are looking at is bookings (i.e. the user ended up completing a trip with Turo ). Well in that case, we have two distributions, that can take the form of green [1–0–1–0–0–1–1-..] and purple [1–1–0–0–1–1–1-..]

Let’s say the average value of green is 10% and the average value of purple is 11%. How would you know if this is due to complete randomness, or actually caused by the difference between treatments? The answer lies in what is called statistical analysis and, here too, we tested one method (t-test) before eventually implementing another one (Bayesian).

A quick overview of frequentist vs Bayesian

When we first developed TuroAB, we implemented a frequentist approach. The idea is to perform a t-test and look at the probability of wrongly rejecting the null hypothesis (no difference between the two samples). If the probability of wrongly rejecting it is small (generally it is considered small below 5%), we say that we are confident in the fact that the two samples are different. This probability is called p-value.

This approach has been considered for a long time as the best working way of dealing with experimentation, but a few well identified problems should make you skeptical:

  • The p-value is a hard to grasp concept. Most data scientists themselves don’t exactly know the subtleties of its definition, so it is even more challenging to communicate to non-technical audiences. Furthermore, it is often wrongly interpreted as the probability of B being better than A, which is wrong.
  • In this framework, the parameter (say, the conversion in the group of people you are treating) is considered fixed, and you are trying to estimate it. This approach hence gives you a certain confidence of the parameter being in an interval, but it is not trying to estimate the most likely parameter value given the data.
  • The frequentist approach is not very resilient to the analysis of multiple metrics , buckets, or even number of times you look at it (peeking). Indeed, a proper frequentist test should be conducted with penalization for every new metric added, every new bucket added, and also penalized for every time someone looks at the test to gain some insights.
  • No prior knowledge can be leveraged to accelerate our decision making.

Those main obstacles can be avoided by opting for the bayesian approach. Essentially, this approach gives a distribution of probability of treatment B being better than A (or control) by considering the parameters as random variables. Unlike the frequentist approach, it’s here the exact concept of probability we are all familiar with: derived from the bayes formula, we obtain the posterior probability distribution, which is an update of the prior distribution given the new data collected.

Even if it is not widely used yet in TuroAB, the prior probability actually represents our prior belief of the shape of the distribution, even before launching the experiment (if no assumption is made, we call it a “flat” prior, which is not that rare in our everyday analysis). If I expect a drop in conversion, I can then explicitly state it in the formula, and if that actually happens, the formula will make me gain some time. If it doesn’t, I will only lose time, but I will not decide to enshrine the wrong treatment (at worse I will not detect a winner and have a false negative).

You get it, we decided to implement the Bayesian framework in our experimentation platform.

What is featured in TuroAB?

TuroAB is primarily a platform made for product and exec teams, in order for them to make decisions based on what they see. They have access to all experimentations, past and present, with a single drop-down menu. Once in the experimentation, they can select the treatments they want to compare (if you have multiple treatments, you may want to compare A vs C, for instance), and also different partitions. The partitions can be really useful while analyzing the results. For example, you can show only the results for users on native apps, or customers coming from Facebook. It can be a great feature too if you are considering targeting certain segments of your population with specific treatments after the end of the experiment, instead of rolling it out to everyone.

Once all those filters have been applied (experiment, buckets, partition), the platform displays three types of data, for each metric included in the test. Let’s take the example of the metric tracking trips that have been rated 5 stars:

  • The posterior distributions, that are a clear characterization of the result of the experiment. The more overlapping there is, the less confident you can be in treatments being different.
  • The distributions of the difference, with its mode (maximum value) being the most likely value of the difference between the distributions
  • The actual data in a summarized way, to have all the estimates and probabilities in a glimpse of an eye.

Of course, every experimentation comes with both its engineering ticket to track its implementation and a short description to make the platform as autonomous as possible in terms of analysis.

Some steps remain

Of course, we are not quite there yet. A lot of things are still in construction in the area of experimentation in both data science, product management, and product engineering teams. Adding more diverse tests (trust and safety, public relations, etc.), incorporating different priors or even spreading usage further across the company are just the tip of the iceberg. But those are necessary steps to building an even more robust platform and system that will help the company learn & grow in the short and long term. And help burrito makers as well, obviously.

--

--