Data Science in the Trenches: Living w/ Small n
Somewhere, someone’s having this conversation. Right. Now.
Disclosure: If you’re looking for a stats heavy article, this isn’t it. I leave that to my betters. We’re not talking math here, but pragmatic “how do you deal w/ this situation at work because you need an answer” stuff.
TL;DR: Worry less and ‘Do Science!’
One day, a colleague comes to me with a question every analyst gets in their life: “We’ve got these 2 designs for a page and we want to run an A/B test. BUT it’s a power-user feature on a page that’s rarely visited. How much sample size do we need to get statistical significance, what can we do?”
I normally have a very rough rule of thumb that if you’re running a straight A/B for a web thing, you’d like at least n ≥ 4 digits, running for a minimum of a week before you can even dream about having something vaguely representative and resembles a result.
But to be more rigorous, I pulled out a sample size calculator, and asked them to guess the effect size, because that’s the big parameter. As expected they responded “huh? no idea!”. Not to blame them, it’s totally not their fault
A bit more poking and we agree that a 5% lift in conversion rate was a good a guess as any, so we soldier on. A couple of other arbitrary guesses about things like variance and means, and we get an answer: n=1400(ish).
“At current traffic volume that’ll take years! What can we do?”
Well, there’s a couple of things that you can consider, all of which involve different kinds of trade-offs.
Why this conversation is happening
I know among the scientific and statistical community, there’s a hot debate over the concepts of statistical significance, hypothesis testing and alternate methodologies. To be frank, many technical aspects of that conversation go over my head because my stats background doesn’t go that deep. A typical layperson doesn’t stand a chance following all that.
For people who’s primary job isn’t designing and working with experiments, the phrase “statistically significant” approximately means “the stats people say this effect is ‘real’, so I can use the result in my life with confidence.”
People higher up the food chain, including executives, also share the above conception. It’s a generally accepted shorthand for having done your homework properly when presenting a decision based on the results of a test, and you can get grilled on it if you DON’T have that significant result when you explain yourself.
This is the backdrop that someone comes to you with this problem.
Idea #1: Stick to the A/B, but relax the statistical power
This is the simplest thing to do, in the sense that you have to change people’s thinking the least. You can play with the parameters of the test to make different trade-offs between statistical power and your chance to reject the null hypothesis.
For example, you can change α = 0.05 to α = 0.10, it’d lower the required sample size, hopefully to something within a reasonable amount of time. Check out the the more technical ideas of balancing α and β in this great article about things you can do about sample sizes.
If that doesn’t work give you a small enough sample, or relaxing constraints isn’t acceptable, we’ll have to explore other alternatives. But it’s always good to have these conversations about what’s expected of the test.
Idea #2: Switch to a Bayesian testing framework
The one thing I like most about Bayesian A/B methods is that it expresses results as a very natural “probability Variant A is the best” that even laypeople grok without much explanation. It also scales to multiple variants without lots of adjustments and/or corrections factors for multiple naive comparisons.
You can also watch the experiment results repeatedly over time (usually a no-no in the traditional way because SOMEONE always wants to jump the gun). Effectively speaking, you’re punting to the future the decision about all the alternative setups of balancing α and β in Alternative #1 because you’ll see the posterior probabilities in front of you as the test goes, and decision makers can decide on their level of comfort in the overlap of the
The usual downside is that it’s a less common method in the wild, meaning you might not have tooling for it in your environment, and the assumptions built into available calculator tools might not match your specific situation. There’s also the debate about selecting reasonable priors, but in practice for simple A/B tests I haven’t worried about it much (maybe I should start?).
While Bayesian methods may offer a cleaner framework for working with lower sample size, it also doesn’t guarantee a result either. If a traditional hypothesis test isn’t going to give you an answer with the same data, a neither will this.
Instead, what you get tools for having a discussion with decision makers. Would management be fine with Variant B winning with 75% probability? How about 60%? I’d argue that these discussions are the most important part of the process. (Yes, you should’ve had those conversations long ago, but the parameters have been so baked into traditional A/B it rarely happens in practice.)
Idea #3: Rethink what you’re doing
Let’s take a step back for a moment.
Recall that the whole reason why we do A/B testing to begin with: we want to see if Variant A is “better” (for some definition of better) than Variant B. We also want to be confident that judgement wasn’t just due to random chance. The typical testing framework is but one method to arriving at this — it’s very rigorous and forms the foundation of much of science — but it’s not the only way.
In industry, we’re working under time pressure and uncertainty. Implicitly we accept that our decisions can be incorrect because making no decision can be worse than making the wrong decision and quickly changing course later. Our test isn’t the final word in product development, it’s not usually even a footnote. Anything we make today will be iterated out of existence at some point in the future.
If we were in academia and in search of Truth (with a big T) and contributing to the sum of humanity’s knowledge, then the standards of scientific rigor is much higher. If out product decision literally means life and death like in drug trials, then our standards should be high as well.
Whether we have we use a big green call to action or a carousel of product banners… probably not so much.
Temper your decision making based on your cost of test failure
We’re not testing because it’s fun, we’re trying to make decisions based on evidence that one choice is better than another. That also implies that there is a cost (either in lost sales, bad user experience, boss’s mood, whatever) of choosing the “wrong” variant.
So what’s the actual cost of you picking the wrong variant because your test didn’t have enough statistical power for you to choose the better one? Are lives at stake? Or will you miss out on 10% life in sales, off a low base of traffic (which is what brought us to this discussion in the first place). Remember that if you have a HUGE actual effect, it should shine through the test despite the low sample size.
I’d also argue that if you’re in a startup situation, these are exactly the sorts of changes you’re looking for, not the small 5% optimization off a small base.
Use qualitative methods, Live with wide CIs
While DS is ultimately a quantitative endeavor, the end goal of testing is to decide whether one alternative is better than another. User testing, observations, surveys, interviews can all provide additional data points and insight into what’s going on.
You can mix even some quant in here: say you observe that your 5 user tests all have an average task completion time of 8 minutes +/- 4 minutes at the 95% standard error (you’ve got a giant interval because n=5). Well, are you OK with most users taking somewhere between 4 and 12 minutes to complete the task?
In a perfect world, user testing would have happened before Variants A and B were even designed and you’re using the A/B to essentially see if your user tests generalize. But we all know that this doesn’t happen all the time, so if you haven’t done it yet, now’s your chance.
Is one variant better for the company anyway?
Oftentimes, products change for reasons besides simple optimization. Updating an aged site design to fit with current trends, doing a major refactor to reduce tech debt or enable future projects, etc. In those instances, saying “the new version doesn’t appear statistically to be different from the old one” can be a acceptable thing. The idea for “testing the new thing” usually is because it’s become routine to test most things before launch.
On good days, I’ll call these situations “safety checks”.
On bad days, I call them “cover your a@#” tests.
The thing to be very firm about here is that you must be VERY clear that that you’re doing this. You’re willfully stepping out of data-driven methodology. The decision has largely been made that have nothing to do with the things measured in the test, and we’ll only back off if the test comes back and tells us we’re having a statistically significant negative effect on the company. If everyone’s willing to own up to that, then in my opinion this is a perfectly rational decision for a company to make.
The danger is when teams do CYA tests and think that they’re still being data-driven, because “we’re running tests”. No, they’re gaming an reward/punishment system that’s in place and they need to be called out on it. It’s a trap I’ve seen many teams fall into without realizing it and can really screw up the data-driven culture you’re trying to foster.
Fail anyway, learn something for the next iteration
Science is a process and a body of work, a web of studies that (assuming there’s an objective reality) all (generally) point in the same direction and replicate/support each other’s findings. As scientists, we should take this to heart — even our α=0.05 test admits a 1 in 20 chance that our test was a fluke.
Just learn as much as you can with this test and incorporate it into the next one. The user test come back that people gravitate to faces on the new design? Or you had feedback that things were hard to find? If those pointed to true facts, they will apply just as much the next time.
Did the distribution of users seem to shift in the new version? Did a metric look higher in one than another, but wasn’t significant? Those are things to keep an eye on next time when you might have more traffic. You never know whether some of these new hypotheses will “click” one day and unlocks something big.
So have these awkward conversations where the answer is “it depends”. Get people comfortable with the risks involved. Just keep moving forward.
I realize I’m walking a tightrope between scientific rigor and handwaving BS with this piece. This is how I’ve evolved to handle these conversations, and I’m sure others have found different and better ones (which I’d love to hear about).