Experimentation Analysis at Lime

Published in

Lime Engineering

9 min readJul 18, 2020

Inaccurate experimentation analysis leads to suboptimal business decisions. As Lime grows, the importance of standardized experiment analysis increases commensurately— with hundred of thousands more riders on our platform today than two years ago, the wrong business decision now can impact thousands more riders than it could before. In order to be agile with our testing while maintaining rigor in our experiment analysis, we built an experimentation handbook for teams to reference. The Handbook not only consists of guidance on how to best set up and analyze tests, but has also created a standard set of methods for the Data Science and Analytics team to employ. This blog post includes some sections from our Handbook, which is organized into Pre-Test (test setup), During Test (checking in), and After Test (results) sections.

Pre-Test (Set Up)

Picking an Experimentation Setup

When determining which experiment setup to use at Lime, there are a few key factors to consider.

1) What kind of test are we running (i.e. what are we randomizing on)?

Rider Tests: Tests affecting the Rider experience such as UI changes or new promotions are Rider Tests with KPI metrics like Rider Retention. These tests are normally randomized on Riders.
Deployment Tests: Tests that affect deployment strategies, such as changes to Scooter Hotspots (if permitted to do so), are deployment tests with KPI metrics like trip start conversion. These tests are generally randomized on Scooters.
Hardware Tests: Tests on the Scooter hardware or firmware itself, such as an improvement to scooter parts, are generally randomized on scooters.

2) Do we anticipate the treatment to have network effects?

After determining whether our treatment should be tested via a Rider, Deployment, or Hardware test, we then need to consider whether our test treatment will have network effects, also known as “interference bias”. (As a quick summary, “network effects” occur when the change in behavior for users in the treatment group also ends up affecting the users in the control group, thereby biasing the results. As an example, if we made Lime scooters free for treatment group users, then likely scooters rides would increase enough such that the users in the control group would have a harder than normal time finding an available scooter to ride, which would make the control group no longer acting at the “normal” baseline.)

If we don’t anticipate having network effects, then we can often use an A/B test setup. However, if we do anticipate network effects, then we usually end up using a Pre-Post experiment method or a Switchback test (both of which are defined in section 3b).

3) Which specific test setup to use?

3a) When there are no anticipated network effects, generally an A/B test is applicable. A/B tests are controlled tests where some portion of users get the test treatment and some portion of users get the control treatment. The main next step is to determine the size of the test and control groups. Here is a rough guide for how we decide the sizes at Lime:

3b) When you do have anticipated network effects, then there are two main test options we use at Lime:

The decision tree below summarizes the methodology explained above.

Dealing with Overlapping Tests

As Lime scales, we have more tests running simultaneously that could be affecting each other thereby introducing bias. If separate tests’ treatments are impacting the same users and/or same metrics at the same time, then we can’t be sure what the individual impact of a test may truly be (e.g. running two tests that both aim to increase the adoption of the Lime Wallet feature). This table outlines Lime’s 4 solutions to run more than one test at a time with similar impacts :

During Test (Checking In)

Checking for Sampling Bias

Whenever running a test, at Lime or elsewhere, we need to ensure that there aren’t inherent biases in our samples in order to trust any insights we draw by comparing a treatment and control group. The most basic way of ensuring there’s no bias is by running the test for a long enough time to have a large enough sample size such that it decreases random bias. In addition to this:

If the test is pre-assigned (users are all exposed at one static point in time), then we do a quick A/A test, which entails splitting our test and control populations to compare basic metrics between the two groups without yet having the test users exposed to any treatment conditions. Once we’ve confirmed that there are no differences in metrics such as trips per user or app sessions per day, we can rule out the potential for sampling bias.
If the test has live assignment (more users are exposed each day), then an A/A test is not a viable method to rule-out sampling bias. Instead, once the test has reached completion, we check the similarity between metrics pre-exposure between the test and control groups (similar to a post-factual A/A test).

When comparing groups to ensure similarity, we also do a sample ratio mismatch check (i.e. run a significant test on the volume of exposed users) to ensure that a difference in the exposure volumes themselves don’t introduce bias.

Dealing with the Peeking Problem

Peeking (looking at test results before reaching the needed sample size (n)) causes the probability of a Type 1 error (rejecting a true null hypothesis) to increase. The most common way to deal with this is to calculate the required n before starting the test and then only making a decision when you have reached that sample size. However, the drawbacks to that solution are that:

We want to peek to stop bad things from happening or flawed experimental design >> Fix: we can still do this as long as we don’t report the interim results officially; we can just use it as a safeguard in case of negative things occurring.
We are too reliant on our power analysis calculations before the test starts (i.e. a priori effect), e.g. we overestimate the effect we expect to see or that we need to see from a business perspective >> Fix: we can always err on the side of a low effect size like 2% over 5%. (Means higher n necessary though so not very sustainable)
The p = .08 problem: It’s clear that we are almost at significance and directionally moving that way. But we need to let the experiment run a bit more, collect more n to get to the 5% threshold >> Fix: we can always err on the side of requiring a larger n than we think we need. But again this doesn’t seem like the optimal solution.

At Lime, our solution is to run a normal power analysis before the experiment starts to calculate the day on which we should have sufficient sample size to check results. We can only check the final results one time on that planned day and need a 95% confidence level to claim the results are statistically significant. If we want to check results before that time (or are requested to do so), then we can only claim results are stat sig if they meet a 99.9% confidence level as opposed to the initial 95%. This reduces the likelihood that we are falsely rejecting the null hypothesis by recalculating the impact multiple times. We decided that this solution is better suited than using Bayes Factors (another common solution in the industry) since it is a simpler calculation and we are still building out our internal experimentation analysis tools.

After Test (Results and Roll Out)

Picking a Statistical Test to use

It’s important for our teams to be aligned on which statistical test to use when, so that we are reading impact in a consistent way. Here are the guidelines that we use at Lime:

For continuous metrics (e.g. revenue per trip), we are comparing means and can generally use a two-tailed t-test. If there is a small sample size, then we use a bootstrap t-test.
For ratios (e.g. WoW retention), we also recommend a t-test. We use t-test instead of a z-test because z-tests require more statistical power and are generally only more efficient and powerful when the metric being analyzed is binary.
The above two bullet points assume the metrics have a normal distribution, but at Lime, we often deal with skewed data (e.g. trips per user is left-skewed), in which case we use the Wilcoxon Rank-Sum test for which we don’t need to have an assumption about the distribution.
For categorical variables, we would propose using a chi-square test. However, categorical metrics are currently rarely used at Lime, so we haven’t built this into our experiment analysis pipelines.

How to Accurately Run Multiple Comparisons

This issue with running multiple comparisons (e.g. if you compare a metric for each Country exposed to the treatment) is that the statistical probability of incorrectly rejecting a true null hypothesis will significantly increase as the number of simultaneously tested hypotheses increases. There are many different ways to help correct this issue — each correction method introduces a trade-off between correcting for false positives and reporting too many false negatives. Given each experiment is nuanced, this trade off of values must be made to best suit the needs of the business. At Lime, we use the Benjamini–Hochberg (BH) procedure to adjust the p-value, which is not only straightforward to implement but also less strict than the Bonferroni correction method (another method widely used in industry), which assumes that each test is independent of each other.

This is a simplified version of how we choose a hypothesis test but shows the main framework. The most common additional test we use is ANOVA for comparing metric impact across multiple variants.

Extrapolating to Topline

Very often people look at the impact seen in a test (e.g. +5% retention) and then assume that’s the lift on topline; however, this is inaccurate if the test is only exposed to a certain population of riders (e.g. only new riders or only Paris riders). As an example, in San Francisco, Lime Riders need to lock their scooters to a bike rack after ending a ride. If we make a change to the locking process, test the improvement only in San Francisco, and see a +10% increase in trips taken, can we claim that we’ll see +10% trips globally? No, because only San Francisco was exposed in this test treatment affecting a feature (lock-to) that is not present in all global markets.

Accurately calculating the topline impact is pretty straightforward — it’s generally solved by multiplying the impact we see by the percentage of users who are eligible for the treatment. Therefore our proposed solution is to determine this percentage when sizing the test (this is often the same as the trigger point, e.g. the proportion of people who open wallet out of all exposed users who open their app) and then to create a table like this for stakeholders to easily understand what the global topline impact is.

Summary

By taking the time to standardize experiment analysis methodologies within the engineering team at Lime, we are not only able to maintain a consistent rigor in our analysis across the board, but we are also now able to build our first experimentation analysis platform and incorporate some of these principles. Our analysis platform already incorporates principles such as using BH corrections when measuring significance and will soon include other automated principles such as A/A test checks. As a team, we have many more topics to discuss and standardize (e.g. the best way to check for network effects via counter-metrics), so stay tuned for future blog posts!

Acknowledgments

Many members of Lime’s Data Science and Analytics team contributed to our handbook including Tristan Taru, Dounan Tang, Jeh Lokhande, Siyi Luo, and Ben Laufer.