Is the result really significant?
There seems to be an awful lot of talk amongst software teams about hypothesis testing, experiments and A/B testing. All with the aim of making better informed decisions and delivering value faster. However, some key concepts seem to get lost in translation, as is common when approaches are newly adopted.
At SEEK a pattern was emerging where an experiment would be run on 5% of users and the next recommendation from the team would be: “we recommend increasing to 10% because the results are positive but we don’t feel we have enough data for the result to be statistically significant”. In this recommendation, the blend of how the team is “feeling” with the misuse of terms like “statistical significance” is what I started labelling “emotional significance”. That is, the result wasn’t enough for the team to emotionally feel the hypothesis was proven as opposed to statistically proven. If our goal is to deliver value faster then the conclusion of an experiment should be guiding our efforts towards producing value. Thus, the result should give us confidence that either:
- yes, this change has the positive effect we are after and our efforts should focus on rolling out to all the target audience
- or no, there is no evidence that this change makes things better, thus our efforts should focus on why and what other options are available to achieve the goal
We work closely with people who have a high level of expertise in research and statistical analysis, and it was clear that their knowledge needed to spread to the broader team. Thankfully getting sound hypothesis and experiments didn’t require every team member to become an expert statistician. However, for my team it did involve a journey.
Who remembers their statistics class?
Our journey started by working out where we were. At stand-up I confessed that it had been a long time since I’d studied or used statistical analysis. I asked “Who can help conceptually explain statistical significance?”. No one volunteered. Then I learnt that several Comp. Sci. and Comp. Eng. graduates hadn’t actually studied statistics during their degree! Thus, my rusty explanation had to do for now. It’s important for our team to be able to estimate the amount of data needed for significance upfront. This tells us for how long we would need to run a test, given a certain percentage of transactions.
Our first success was that conversations about emotional significance disappeared.
Our next step was to construct an hypothesis and experiment. Unfortunately our colleagues with a strong statistical background had no capacity to help at this time. However, this was actually beneficial for the team’s learning. The first hypothesis and experiment was flawed in multiple ways. But the mistakes we made were key to our collective learning. One mistake was under-investing in the hypothesis and success criteria. The success criteria we set could be described as ‘hitting a home run’. Once the results came in, which were positive, but less than a home run, the debate turned to moving the success criteria.
A well structured hypothesis with a sound experiment should leave little to no room for debate around success or failure when the results are in. The debate should be around why the result was achieved. This discussion deepens our understanding and focuses our effort towards work that will deliver value to our users.
Note, during this first attempt we used simplistic tools for estimating sample size etc. based on the assumption that we had a normal distribution. For the domain we were working with, this was a naive assumption. However, it was again extremely valuable for the process of learning and lifting our statistical acumen. It was also still better than the “emotional significance” mode we had been operating under. Previously the team would start an experiment which would run “… until we have enough data”. With zero certainty of when that may be.
At this point our stats capable colleagues had time to work with us. As we increased our understanding of statistical concepts, we became more productive. The software team worked closely with their more capable stats colleagues to figure out why and how they could deliver value, faster. This cross-collaboration led to a stronger hypothesis and experiment for the next change. The experiment was carefully designed to prove the hypothesis wrong.
We could see the benefit of clarity versus ambiguity — which is a topic in itself. Gradually, we saw the impact of knowing in advance, what the experiment would do, why, and the actions we would take for different results. It was fantastic to see this and to explain it clearly to our stakeholders.
Discipline is hard
Experiments and statistics are one way to take the guesswork out of what we do and keep our biases in check. Buoyed by the results of our first experiment we rushed to apply the same change in a different context. The second experiment showed that the change did not work for this use case. It was a face palm moment. However, once we had the numbers and asked ourselves “why?” the difference in the two use cases were plain to see. There was a faster and cheaper way to come to the same conclusion. If only we had asked ourselves:
If the results of this experiment came back negative what would be the reason?
This highlights how adopting a critical mindset is more valuable than just ‘doing’ experiments. You need to start the doing and acting to embed the thinking and understanding. You have to start somewhere.
What about your journey?
I’m sure there are people with solid stats training in your organisation hiding somewhere.
My advice is to befriend them today.
Tomorrow, try a disciplined approach to creating sound hypothesis and statistical analysis. Whenever you are unsure about what to do next, assume your hypothesis, experiment or statistical analysis is flawed. It may not be, but unless you have an academic research background it’s where I’d bet my money.
Plus, this experience will give you an appreciation for what the beautiful data-science-economist-stats nerds do and you’ll be primed to team up with them.