A/B testing is a common tool to assess the performance of a new feature in data-driven companies. Yet there are many ways to get it wrong, so we felt it was worth sharing our case and our solution.
At BlaBlaCar, we nurture a marketplace comprising drivers posting rides and passengers booking seats on those rides. The interaction between those two populations makes it dangerous to trust the results of a classic randomized A/B test: so-called interferences arise and create a bias in measurements.
Read on to learn more about the problem and the solution we designed. Lyft Engineering posted a series of blog posts entitled Experimentation in a Ridesharing Marketplace that proved a great entry point into the literature about causal inference and interference. Our experimental design has benefited a lot from what they learned, although the contexts and final solutions differ.
How the marketplace works
Drivers define a ride as:
- a departure point
- an arrival point
- a departure time
- a list of intermediate stopovers.
Passengers then perform searches and can book a seat on rides that have stops matching their search query. A drawing goes a long way towards explaining the mechanism:
BlaBlaCar’s marketplace is actually much more complex if you consider pricing, time constraints and additional options, but it’s not necessary to go as far for the purpose of this article.
Introducing a new feature
As a passenger, traveling for hours on the backseat of a full car can be uncomfortable. For this reason, BlaBlaCar introduced “two max in the back”, a flag that a driver can set to indicate… well, that only two passengers can travel on the backseat.
The impact of this feature on the marketplace is hard to predict: “two max in the back” can attract comfort-sensitive passengers, but at the same time limits the number of seats available. Will more passengers travel at the end of the day?
Ideally, we would need to observe and measure performance in two parallel universes: one where drivers can set the “two max in the back” flag and another where drivers can’t.
Unfortunately, we can only observe one universe. But we get to choose the conditions: all drivers can set the flag? None? Only a fraction? All drivers can set the flag, but some passengers can’t see it?
The goal of the experimenter is to select the design that constitutes the best proxy to measure the difference in performance between the two ideal universes (“two max in the back” for all versus “two max in the back” for none).
Measurements in ideal universes
For now, let’s use a toy example to compute the performance of the marketplace with or without “two max in the back”.
Here’s our marketplace: two drivers, D1 and D2, go from city A to city B, and a passenger P wants to do the same. Additionally, D1 and D2 think “two max in the back” is a good idea and set the flag if they are given the choice.
Passenger P can either decide to use another means of transportation, to book a seat in D1’s car, or to book a seat in D2’s car. Let’s assume the following probabilities in this toy example:
In practice of course, these probabilities are unknown.
The expected number of booked rides is thus 0.5 in the control universe where no driver has the feature (first row). In the treatment universe where both drivers have the feature, the expectation is 0.6 (second row).
Hence, the impact of the “two max in the back” feature on the number of booked rides is of +20%. This is the uplift that we would like to measure by performing an A/B test.
The classical method for performing an A/B test is to show the new feature only to a fraction of the population, chosen randomly.
In our toy example with the “two max in the back” feature, this means letting some drivers set the flag while others can’t. If D1 has the feature (she is in the control group), her expected number of bookings is 0.4. D2 (in the treatment group) has an expected number of bookings of 0.2. Comparing the two results, we would erroneously claim that the uplift created by “two max in the back” is +100%, when it actually is only +20%! The same happens if control and treatment are inverted.
Learnings from the example
Surely, the example above seems rather contrived: does a full marketplace work the same under randomization? Is it not just a problem of small sample size? Is the probability table not tweaked in order to yield those results?
Indeed, the toy marketplace example serves the only purpose of exposing a measurement bias caused by competition between drivers. But simulating more realistic and bigger marketplaces helps one realize that this behavior is carried over at scale.
What have we learned from the example? Competition between drivers for passengers causes a bias in the measurements made under the classic randomization setup. Here, it takes the form of a zero-sum game: what one driver gains in terms of probability of booking is the loss of another driver.
A different toy example would show that the reverse is also happening in a marketplace: passengers are in competition for the seats on rides offered by drivers. Even more bias!
In statistical parlance, the competition between actors on the marketplace causes interference: control units (e.g. drivers not having the “two max in the back” feature) influence the performance of treatment units (e.g. drivers having the feature), and vice versa.
The emblematic example is that of vaccines for infectious diseases. If a person receives the vaccine, his family and friends benefit from him being immunized. In general, the treatment received by a subject decreases the probability that another catch the disease, thus creating an interference.
For more information on the problem of statistical inference, see the Lyft blog post series Experimentation in a Ridesharing Marketplace.
Better experiment designs
Let us state again our objective: measure the uplift caused by the introduction of a feature on the marketplace. How to do so without suffering from interference?
The general idea comes from the two-parallel-universe setting we try to approximate: let us create independent sub-marketplaces, and reveal the new feature to some of those marketplaces and not to others. Then by comparing the aggregated performance of treatment marketplaces to that of control marketplaces, we can assess the uplift caused by the new feature.
How to create independent sub-marketplaces? This is mostly domain-specific, as every marketplace can and should be partitioned differently. Yet, another general issue arises when trying to create independent marketplaces: the introduction of selection bias.
To demonstrate selection bias, let’s take this dumb example: two countries are used to test the feature, one is used as control and the other is used as treatment and has the new feature. Comparing the two countries to assess the performance of the feature makes no sense, as the two countries, though independent, may differ in many other ways.
A less dumb example is to alternate the activation of the feature over time over the whole marketplace. Hopefully, there is less selection bias than in the experiment design above. It is however possible that an unexpected event happens in one of the time slots, perturbing the measurement.
This goes to show that to prevent statistical interference, we have to accept a little selection bias. The objective is then to find the best tradeoff.
Finally, there is another implicit requirement, the test must include as many units (be they drivers, passengers, or vaccine subjects) as possible, in order to preserve statistical power.
As a recap, here is the recipe. The objective is to create sub-marketplaces that:
- are approximately independent
- do not suffer from too much selection bias when compared to each other
- collectively cover as much of the marketplace as possible.
Let’s now apply these ideas to BlaBlaCar!
Experiment designs in a carpooling marketplace
At BlaBlaCar, we have chosen to activate features based on rides offered by drivers. Adopting this point of view, an interference arises when a passenger’s search results shows rides from both control and treatment groups.
In order to segment drivers into non-competing groups, there are two natural dimensions: geography and time. Indeed, trips that are geographically different do no compete, neither do trips with different departure dates.
As stated above, pure geographical or time segmentations create a sizeable selection bias. As a consequence, we have chosen to mix both dimensions into the experiment design below.
The geographical buckets will be defined below.
Why is this better than pure geo or time segmentations?
- Pure geographical segmentation introduces an obvious selection bias that cannot be mitigated.
- Provided there is some seasonal correlation between the two geographical buckets, the design above suffers less from selection bias due to seasonality than pure time slot alternance.
Note that we cannot use finer time slots than days because passengers tend to be very flexible regarding the departure hour.
The last remaining task is to divide the drivers into non-competing groups based on their geography, the “geo buckets”.
At this point, it is important to note that the stopovers described in the first section of this post impose severe restrictions to how we can segment drivers.
For instance, assume that we divide rides according to their departure point:
- rides departing from northern France constitute geo bucket 1
- rides departing from southern France constitute geo bucket 2.
Imagine a ride going from Paris (northern France) to Montpellier (southern France) and declaring a stopover at Clermont-Ferrand (southern France). This ride is in geo bucket 1, but on the Clermont-Ferrand-Montpellier segment, it may compete against rides from geo bucket 2! Note that we cannot activate and deactivate the A/B-tested feature depending on segments within a ride, otherwise the driver would see the two versions, which is an interference.
Taking this into account, we have designed the following partition based on direction rather than location:
- all rides whose segments are northbound fall in the northbound bucket,
- all rides whose segments are southbound fall in the southbound bucket,
- all rides containing alternating directions are not considered.
The strict condition on segments ensures that a passenger has little chance to see two rides from bucket 1 and 2 in the same search results. Of course, it’s more frequent to see a mix of rides from a bucket and out-of-test rides. But out-of-test rides are few and they only cause indirect competition between the two buckets.
All these properties can be quantified (the lower the better for all):
- proportion of out-of-test rides
- proportion of passengers who see rides from buckets 1 and 2
- proportion of passengers who see rides from a bucket and out-of-test rides.
We have thus been able to compare different partitions:
- the partition on the departure point used as an example above
- the direction-based partition explained above
- another partition wherein the country is divided into two areas and cross-area rides are out of test.
The table above indicates that the best tradeoff between selection bias and interference is reached by the direction-based partition.
If you happen to post a ride on BlaBlaCar, depending on the direction of your ride and the parity of the departure date, you may end up in a control or treatment group!
This non-trivial experiment design enables us to assess the impact of new features with the following properties:
- Little interference (almost only indirect competition through the out-of-test rides)
- Limited geo and seasonal selection biases by mixing the two dimensions
- More than 4 rides out of 5 are inside the experiment.
This design comes with its own set of limitations: the duration of the test must be a multiple of two weeks, so that control and treatment groups contain each weekday; some special dates like the first day of school holidays create a selection bias in favor of one of the two groups (e.g. French families flocking down south on July 14th).
Additionally, we would benefit from building a more accurate set of simulation tools to further confirm the validity of the experiment design. As of today, it has been tested in simple simulations and real-life A/A tests.
Nevertheless, this solution has already allowed us to test some major marketplace-level features, with more confidence in the results than ever before.
At BlaBlaCar, we believe investing time into sounder statistical tools and methodology is key to making good product decisions. Don’t hesitate to contact us to exchange about this topic or share experience!