Our journey to experiments

Richard Bremner
Koala Digital
Published in
6 min readOct 9, 2018
experiments can get out of control

In the beginning, Koala created the website and mattress. And Koala saw that it was good.

When you build a single product company there are a lot of shortcuts you can take and assumptions you can make. When that’s a raging success you can sub-optimise around it. The company became really good at selling mattresses and delivering them in 4 hours. The mattress itself is our proprietary formula and it has over 13000 five star Yotpo reviews, our customers love it. We built backend tools to fill the gaps between warehouses, inventory, transport, and finance. Where we found impediments we crushed them. With such singular focus it’s like the company was on rails.

From one to many

The problem with being on rails is that it can be hard to change direction. A young startup like Koala is a very fast changing environment. We introduced new products like the Bed Base and the Sofa, and have many more in the pipeline. When I joined in April ’18, OKRs (Objective Key Results) were already being introduced. Teams here have an insane amount of autonomy. We have goals, we are left alone to deliver and are supported where we need it. That being said, there was plenty of opportunity for improvement. We were tracking too many goals which made it complicated to know when and where to focus.

At Koala I’m accountable for the Digital team’s results. This includes our e-commerce websites and backend fulfillment platform. At this time the team had committed to ~20 keys results. None of them really moved any needles for the company. Velocity was high and a lot of work was being done, but it wasn’t the right work. Whilst the business was still growing, the team wasn’t having the net benefit that was its purpose of being. We’re not satisfied to just exist, we’re obsessed to be delivering value to our customers with a measurable benefit to the business. So we ranked and prioritised our OKRs. We found two true “key metrics” and deprecated the rest. We split the team into two, each owning one of those metrics. Now each team member was super clear what their goal was. Everyone had 1 number to focus on. But there was a catch, we lacked ownership.

Taking ownership

Running experiments, (or a/b, multi-variant tests) was not new to Koala. Whilst we were busy beavering away on ineffectual work, we had outsourced experimentation to an “industry expert”. It’s a big wakeup call to realise that the most important outcome you are accountable for is not only being controlled by someone else, but they’re having no impact over a many month time period. Eyes wide open, we had to take ownership. We had to build the experiments and more importantly, we had to build the muscle and capability to do so.

We decided to make everything an experiment. We use a visual management board to help run the team. One of the great things about them — if not the greatest — is that they expose problems. Ours exposed that we often didn’t know why we were doing a task, or after we shipped it how it was performing. That’s not an unfamiliar story to any Engineering Manager. Try asking someone at your next standup “why are you doing that”. Blank stares are the leaders’ fault.

What’s a visual management board? Without going into detail, it’s a Kanban-like workflow where tasks move through columns from left to right. Our columns looked like this.

Digital team’s Visual Management Board

Each task on the board was informally worded according to the output required by a developer. Fake example:

build a new Add To Cart button

Why are we building a new Add to Cart button? How will we know the effect of doing so?

So we made two small changes that had a big impact.

1. Change “DONE” to “MONITORING”. We now come to standups armed with data on how each experiment is performing.

2. Introduce an experiment hypothesis statement. We looked at the traditional user story template for inspiration. As you probably know, user stories are often worded like this:

As a___
I want to___
So that___

So we came up with our own wording for experiments. It goes something like this:

By doing___ what change will we make
We will___ what effect will it cause
Because___ why do we think this

For example:

By cross selling a related product
We will increase revenue per user
Because customers will buy more items

This forces everyone to think about the purpose for a given task, and also helps the autonomy of the development team because the outcome is clear “increase revenue per user”, so the teams are more able to make tactical decisions on their own.

Now that we’ve taken ownership of our key result, and everything is an experiment that we think may move us in that direction, everything is good, right? Wrong.

Experiments have a speed limit

For a given amount of traffic and conversion rate, there’s a relationship between how many experiments you can run and how long those experiments take to resolve (achieve statistical significance). More experiments means less visitors exposed to each which means it needs to run longer. If we divide our traffic into too many cohorts, the experiments will take too long. We need to move faster than this law allows, so we had to rethink our approach and be a little more pragmatic than “everything is an experiment”. Nevertheless, swinging too far in that direction was extremely valuable. That discipline instilled the experimentation culture we needed. Now we were free to explore a little.

Fortuitously given the timing, our Product Manager Matt shared this article about Riskiest Assumption Tests (RAT). In summary, it’s an approach to very deliberately get to a root assumption, and do the smallest possible thing to test that assumption.

What’s the difference between an experiment and a RAT test? It boils down to the rigour involved. Our experiments are rooted in the scientific method.

* Observe. See something in our data.
* Hypothesis. Form a hypothesis about a cause and effect.
* Experiment. Run an experiment to test the hypothesis.
* Evaluate. Measure it.
* Accept or reject. Either roll it out or kill it.

Our RAT tests are much less formal, and we’re not looking for statistical significance, just making sure we didn’t make things worse. Sometimes it’s ok to not move the needle if we just like this change. This also saves time in the opportunity ranking process that experiments go through. If there’s a limited number of concurrent experiments then we better choose them wisely.

But there’s also another type of task. We call them Just Do Its. In some cases, we think our current metric is either so bad we can’t make it much worse, or it’s not valuable enough to go through RAT or Experimentation, yet we want to make the change anyway. Leave some space for just getting small stuff done.

So we went from doing misaligned tasks and outsourcing experiments to being laser focused and owning our result. Then we pulled back a little. Tasks are categorised into three buckets:

  1. Just do it. Moves super fast. Less monitoring.
  2. Riskiest Assumption Tests. Moves quickly. Medium monitoring.
  3. Experiments. Moves slowly. Full monitoring.

Work is ranked into each bucket according to a combination of factors. This is a simplification but it’s a measure of the opportunity and risk, who’s going to see it, expected impact, effort to implement, how long the experiment would have to run for and more.

Learn faster

Over the course of only 1–2 months, we went from a team that had rarely shipped our own experiment to an experiment first mindset where ideas are being collected, ranked and executed. More importantly, every time we do something we ask “what can we learn from this?” and “what’s the smallest thing we can do to learn it?”.

--

--