Bolt Labs
Published in

Bolt Labs

Tips and considerations for switchback test designs

Co-written by Garret O’Connell and Carlos Bentes, data scientists in experimentation at Bolt

Switchback is a standard test design for conducting A/B tests in the industry where there is likely interference between units, e.g., two-sided marketplaces like ride-hailing services between riders and drivers or e-commerce platforms between sellers and buyers.

In these contexts, the treatment of one rider/buyer can affect the outcome of another, most obviously through changing the availability of drivers/sellers.

There are great introductions and analyses of switchback tests in articles here and here. Here, we want to add to the conversation by flagging issues we’ve encountered when using switchbacks that are applicable in many cases. So, if you’re thinking about conducting switchback tests — or already are — here’s a quick checklist of topics we’ll cover in this article:

Switchbacks at Bolt

Bolt offers various services like ride-hailing, food delivery, scooters, and car sharing.

To optimise the user experience of these services, Bolt has a centralised A/B testing platform that conducts dozens of tests per day across different units of randomisation (e.g. user, driver, courier) and automatically estimates the effects of feature changes on a wide range of performance metrics.

Switchbacks are used to test features that cause changes in the marketplace in terms of supply and demand — such as changes in dispatching logic of drivers/couriers. These features can’t be tested with the usual user/driver-level randomisation because the treatment group of one unit could likely affect the availability of services for units in the other treatment group, thereby causing interference bias.

Sanity checks

In switchback tests, the units of randomisation are timeslices. While this helps reduce bias due to interference between the service users (e.g. riders, drivers, couriers), analysing at the level of timeslices leads to a heavy loss in power. It’s also generally less easily interpretable in terms of business impact, e.g. knowing that there was a 2% increase in GMV per timeslice is difficult to translate to the average user experience.

For this reason, in switchback tests, the unit of randomisation often differs from the unit of analysis, e.g. the rows on which we want to calculate average outcomes. Often, we want to analyse the data at a lower granularity that maps onto units at the user level, e.g. riders/drivers.

However, users within the same timeslice will have correlated responses because their outcomes affect each other. This introduces the problem of correlation between units, which violates the assumption of independence of units in regression models used to estimate treatment effects. To counter this, we use cluster-robust standard errors (CRSE).

An overview of this issue has been presented in this article. As a sanity check for this correction, we simulated A/A tests to get p-value distributions as a sanity check for the CRSE solution.

An A/A test is constructed by configuring both treatment and control with the same default condition. We can simulate multiple A/A switchback tests using historical data and randomly reassigning each timeslice between treatment and control. By construction, we expect the null hypothesis to hold (treatment and control are the same) and the p-values of our simulation to be uniformly distributed. Deviation from the uniform distribution indicates that a metric definition has bias.

Below, we can see the p-value distribution without any correction (left) and when CRSEs were used (right). This approach was easy to implement (a single argument in statsmodels regression) and appeared to provide an unbiased distribution of p-values.

P value distributions for uncorrected (left) and CRSE-corrected (right) A/A tests.

Assignment tracking

Tracking the assignment of units to treatment variants in A/B testing adds engineering effort and complexity to data pipelines. We found a clever solution for switchbacks that allows us to ignore the storage of assignments altogether!

Hashing functions can create random but reproducible alphanumeric strings. Suppose we use hashing functions on some easily known properties of an observed unit in a city where a switchback test is running (e.g. a string of the test id and the timestamp of enrollment rounded down to the nearest timeslice start). We can get a treatment assignment simply by converting this string to bytes and reading the first few bytes. This trick makes treatment assignment for switchback “stateless” and a cinch to analyse.

Sample size requirements

Like any test, switchback tests need calculations for sample size requirements to ensure the test is sufficiently powered to detect treatment effects. Note that timeslice-randomised tests are basically “cluster” designs, where you randomise across groups of units that share some characteristics that might be relevant for treatment outcomes, e.g. the time of the day they’re observed. There is already rich literature on cluster-randomised sample size calculations to draw on (1, 2, 3, 4).

This literature has derived a formula for calculating the power of cluster-randomised tests. It calculates an “inflation factor” (IF) that multiplies the sample size requirement if the test was randomised at the standard unit level of users, i.e. what our timeslice clusters consist of. For example, for timeslices of varying size (the typical case of having traffic that varies throughout the day), we use the following formula:

IF = 1 + ((cv2 + 1)m — 1)ρ

In this formula:

Here’s an example using the statsmodels package (timeslice_id is a unique timeslice identifier, and y is the outcome metric):

res = smf.mixedlm("y~1", data, groups=data['timeslice_id']).fit()
within_variance = res.cov_re.values[0][0]
between_variance = res.scale
icc = within_variance/(within_variance+between_variance)

We then multiply this IF on our standard unit-level sample size calculation. We use means and standard deviations from the most recent two weeks of metric data. From this, we can see how many days we’d need to run the test. We usually round this up to the nearest two-week interval to ensure the treatment groups are balanced for weekly seasonality.

Inter-dependence between units-of-analysis

Initially, we used “sessions” as the unit of analysis for timeslice-randomised tests, defined as an app open followed by an order or no order within 30 min of the last app activity. It was a sensible unit of analysis as sessions can be easily interpreted in terms of business impact (e.g. our feature causes X% lift on average per session). It also can be cleanly assigned to one treatment group, in contrast to users that can make orders in multiple timeslices and, therefore, could end up having multiple treatment group assignments.

However, we later discovered a serious bias risk with this unit of analysis choice.

One day, some stakeholders were testing a feature that caused sessions to crash and forced users to create a replacement session to order a ride. Notably, nearly all users who successfully replaced their crashed sessions with a new session made an order, as shown below.

This behaviour caused a reduction in the average order value per session because the crashed sessions had no value and brought down the average. Note that since orders were replaced, it didn’t lead to a loss in the sum value of orders.

Example scenario where a feature induced dependence between units (sessions) in how it caused users to replace crashed sessions.

Stakeholders could have easily interpreted this drop in average session value as indicating that the treatment led to a loss. While it’s true that a feature that causes crashes isn’t positive, we want stakeholders to get an accurate picture of treatment effects, so this misinterpretation isn’t ideal. Thankfully, our stakeholders spotted this, and we sought a solution to avoid this possible misinterpretation.

The problem is that this feature induced a dependence between sessions, because users were creating new sessions. This causes the biased interpretation of average changes in the session-level unit of analysis. The increased dependency between units also violates the assumptions of statistical tests used to estimate treatment effects.

How do we avoid these nasty biases? Change the unit-of-analysis to user would remove this dependence between units, but as mentioned, users can exist in multiple timeslices or treatment groups. So, we found a unit between sessions and users — “user-per-timeslice”, i.e. the unique user observed during a specific timeslice.

This unit controls for the dependence between sessions, at least those close to each other in the same timeslice, and is permanently assigned to only one treatment group. It’s also reasonably easily interpretable regarding the user experience relative to session-level changes.

To validate if this user-per-timeslice unit-of-analysis addresses the above biases, we looked at how it affected them in the above test data. Specifically, we wanted to see if this unit of analysis could:

We performed these checks in the affected A/B tests and found that the imbalance in units between treatment groups went from -5.4% significant in session units to -0.6% non-significant in user-timeslice units (note: some degree of imbalance is expected in timeslice-randomised tests given differences in activity over periods).

We also found that differences in metrics between treatment groups went down, on average, by 5%. Sometimes the difference between “session” and “user-per-timeslice” units of analysis led to changes in the sign of significant effects (e.g. from positive to negative), indicating that the bias induced by this inter-dependence of units had a meaningful impact on interpretations of results.

Imbalanced assignment of time periods

Originally, we fully randomised the assignment of timeslices to treatment groups in switchback tests, i.e. each day had a completely random schedule. However, we noticed that this had the unfavourable effect of imbalances in the daily time period, e.g. busy lunchtimes between 12 p.m. and 2 p.m. being assigned more to one treatment group than the other.

To ensure daily periods were balanced, we developed a more constrained form of randomisation in which consecutive pairs of days were balanced by inverting the first day (i.e. switching assignments of treatment groups) to get the second day’s schedule. Then for successive weeks, the first week was fully inverted to ensure that if Monday lunchtime were treatment in the first week, it would be controlled in the second week (see illustration below).

Illustration of quasi-randomized switchback schedule over 2 week period.

Using this “quasi-randomised” switchback design, we achieved much higher levels of balance, which reduced the rate of significant false positives in A/A tests arising from this imbalance, e.g. users with a higher order value coming during lunchtimes being over-represented in the treatment vs the control group. Here a similar implementation is outlined.

Coordination of tests

As teams began ramping up their testing, they wanted to experiment more with features that could affect the marketplace. This led to a surge in demand for switchback tests and put pressure on the testing backlog. We only allowed one switchback test per city to ensure tests were not interfering with each other, which can happen for simultaneously running switchback designs, e.g. treatment timeslices of one test occurring more often during the treatment timeslices of another concurrent test, leading to a bias in the estimation of these treatment effects.

To relieve this congestion, we developed a simple approach for coordinating simultaneous switchback tests with reduced risk of bias. To explain, we want to run two tests in parallel — test A and test B. Both options require that one test has timeslice durations that fit evenly into the other, e.g. if test A is two hours long, then test B can be an hour, 30 mins, or 10 mins, but it can’t be 45 mins.

Now, we simply ensure that test B has an alternating schedule that starts on one of test A’s switches (see below). True, it’s not ideal as alternating schedules have a higher number of annoying treatment switches (reducing this is why we do switchbacks in the first place), but it’s an acceptable tradeoff in some cases, especially given that it requires little added engineering effort. This way, we ensure that each timeslice of test A has both a control and a treatment of test B. Indeed, if test A is two hours and test B is one hour, there is still space for test C with 30 mins (if stakeholders are in a rush!).

Illustration of parallel switchback test schedule over 2 day period.

We hope you find some value in our explanation and thoughts about switchback designs. Feel free to reach out to garret.oconnell@bolt.eu with any questions.

Join Bolt!

We offer a unique opportunity for individuals to learn and develop while making a meaningful impact on millions of people across the globe in a hyper-growth environment.

If you feel inspired by our approach and want to join our journey, check out our vacancies. We have hundreds of open roles, including Data and Backend Engineer positions to help us work on our experimentation platform.

If you’re ready to work in an exciting, dynamic, fast-paced industry and are not afraid of a challenge, we’re waiting for you!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store