Crafting great experiments
I’ve recently been involved in designing and running a bunch of different experiments under Lean Startup principles, with an emphasis on testing the value proposition and customer traction (rather than product-specific factors).
While on the surface the idea of experiments seems relatively straightforward, my experience has been they are trickier to get right than they first appear, especially early in the development of a product or service.
We’ve had some #flearnings along the way that I thought would be valuable to share (to “fail informatively” to borrow the phrase from Clay Shirky).
Be clear on the type of “question” you’re asking
The nature and type of experiment changes depending on the questions you’re trying to answer. Much of the stuff I’ve come across on experiments is focused on optimisation, rather than discovery (to use the parlance from design thinking).
By “optimisation” I mean experiments like “if I change this label, or add an explainer video, I will get better conversions.” This assumes you have a lot of things already available (and validated), like the value proposition, the product offering, and the like.
“Discovery” experiments are by definition much more open-ended, and in some cases what I would call “generative” — that is, they generate new ideas and concepts. There is no hypothesis to test… in fact you are often trying to determine what the hypotheses might be!
It’s still important to know what you’re looking for and which part of the Lean Canvas you’re inquiring about (i.e. who is the customer, what is their problem, what solutions might respond to those), but it’s less “cut and dried” what “success” looks like. Lean Startup techniques such as an experiment board with a “what is success” or “what is the expected result” field aren’t very helpful here.
Conversely, once you start to hone in on a line of inquiry, you can switch gears and move from discovery-style activities to more quantifiable approaches. It’s important not to get into “optimisation” too early, though. Not until you are really starting to refine your messaging and execution (usability etc.).
While quantitative methods (such as surveys etc.) can still be helpful, discovery experiments tend to be more qualitative in nature. Optimisation experiments are very much more quantitative — all about the numbers (conversion rates, NPS etc.)
In my experience, there are three broad groups that experiments might fall within:
- Discovery: understanding the customer problem — for example, through research that results in personas, jobs to be done, etc. — co-generation of ideas with customers, early testing of potential solutions and product directions. We don’t yet have hypotheses to test — this process emerges those.
- Proposition: testing the broad “offer” (value proposition) via smoke tests and similar tools, with a mix of qualitative and quantitative activities. We can start to test our high level hypotheses around value proposition and offer (potential product directions).
- Optimisation: testing and refining the actual offer and MVP to get scale. This is where a lot of the things I’ve read sit, with an emphasis on ideas like “growth hacking” — how do you get maximum conversions and traffic etc.
As with most things, it’s not a case of strictly working through each of these as a “phase” before moving onto the next one (in a “waterfall” style approach). You very much flit between them. For example, you may find that optimisation reaches a certain limit, and you need to go back and re-imagine alternative offers etc. Or you might run experiments in different groups simultaneously — e.g. you may have a live smoke test while you’re meeting with a new potential customer group doing discovery work.
Be clear on when to “scale” an experiment
It’s important to cognisant of when to move from face-to-face (F2F) and in-person testing to online-only testing. Online only testing — for example, using Google Adwords to test messaging, or setting up a smoke test or example website — is most useful after you’ve done some initial testing with people in the same room as you (or via remote collaboration tools like Skype etc.).
As a general rule, I’ve found that online only testing is best employed towards the “optimisation” end of the spectrum. That is, once you have a degree of confidence in what you want to test and you’ve ironed out some of the more obvious issues, then you can start to test at a larger scale. In one sense, this may seem obvious, but it presses against some of the ethos of Lean Startup, which is to reduce the learning loop and test in the “real world” ASAP. (I should note it’s still important to “get out of the building” — just do it in person first.)
I’ve found that online-only experiments are great to detect what works in converting, especially when you can get a good volume of traffic driving your experiment. But it’s very hard to extract the “why” behind people’s behaviour in an online only test, whereas it’s relatively easy to do so in a F2F context.
An example: a team I was in ran one smoke test based on a bunch of learning we had done and got zero sign-ups. We then ran it by some colleagues (skilled in UX and copywriting) in person and quickly identified the issues. But not until after we’d spent some our budget on Google Adwords, and worse, had run the experiment for 2 weeks, losing valuable learning time.
Keep your experiments “clean”
I now take a lot more care to not conflate multiple experiments or hypotheses into one test. And I’ve found this can be surprisingly hard to do.
Using that same smoke test as an example: we worked out we were testing a variety of different factors — messaging, value proposition, multiple calls to action etc. — in the one page, when we really needed to simplify and test one factor at a time.
Another example: we updated the home page for a product based on a variety of learnings we’d had along the discovery journey. It was important that we did this (the original version was performing poorly and was well out of date) but while we did increase conversions significantly (a three-fold improvement) it is unclear exactly which aspects of the changes had what effect. Perhaps a better approach would have been to more regularly update smaller parts of the home page (over time) so that we could better evaluate the effects of each individual component.
This is where an experiments (or “validation”) board with a clear “hypothesis being tested” and “success means” value can be really useful and helpful. It not only forces you to think about your success criteria and anticipated results, it often uncovers conflated goals in a single experiment. Sometimes it’s unavoidable, but important all the same to be aware going in, so that you can interpret the results.
The “conversion stack”
Related to “keeping your experiments ‘clean’” is understanding what in the “conversions stack” you are testing. I’ve created a diagram loosely outlining the different elements we’ve identified that we might test in a given experiment that drive conversions. The diagram is something of a visual representation of the factors outlined above. I could probably write an entire post on each of these! In any case, we’ve found this a useful starting point when we’re thinking about experiments.
The “stack” builds left to right — i.e. the offer is dependent on the strength of the idea/concept, which is in turn dependent on the scale and intensity of the user problem you’re solving. All of these contribute to “conversion” in one way or another, but it can be useful to build your tests over time in stack order. That is to say, if you jump straight into smoke testing an offer, if it fails is it the underlying concept, or the execution that’s failing? Whereas if you have tested the underlying concept, you can have a greater degree of confidence that it’s execution related.
An example in practice: we decided to use Google advertising to generate traffic to get meaningful results in things like smoke tests. We setup a number of different ads to test different aspects of the value proposition to drive traffic to a (brand new) smoke test landing page that we hadn’t run before.
We got relatively low conversions but it was very difficult to discern which part of the experiment was failing. Was the messaging in the ads not meshing with the value proposition presented by the landing page? Was it the underlying value proposition, or just how it was expressed on the landing page (e.g. the copy)? Was the specific offer (signing up to a mailing list) just not presenting sufficient end-user value?
We learnt that we needed to run a series of smaller experiments over time testing each part of the “stack” to determine which part wasn’t working (FWIW, we’ve had some early success focusing on the “Execution > Messaging” part of the stack).
Experiments aren’t a “one off” or “to the side” proposition
In pulling together the examples in this post have recognised a pattern in how some of these project teams have been going about their experiment practice.
In many cases, we’d map out a series of experiments, do them, and then get sucked back into the rest of our daily work. Home page updates were done “when we had time to think about it properly” and in big chunks, rather than little tweaks along the journey testing smaller aspects of our hypotheses.
It’s problematic to think of experiments as something “off to the side.” Finding a blend of activities, doing smaller, more granular and targeted tests, would perhaps be a better approach. Could they be a daily or weekly thing, embedded in the day-to-day? Would this get us a better result? This is something we’re moving towards, but I suspect it will be a constant effort to keep on track.
One challenge to this approach, though, relates to the effort involved in getting some experiments off the ground. The Lean Startup methodology suggests the shortest time for completing one learning loop is best. But what happens when your experiment requires a significant degree of effort to execute? For example, building an alternate sign-up process to improve performance, but this may take up to a week of effort to implement. Or: an experiment requires some significant modelling work to get the answers you need to test. It can be quite a challenge to break these down into even smaller chunks so we can build it into daily workflows. And sometimes you just have to bite the bullet and get it done…
A word of caution
The above could easily fall into the category of “over thinking” or “too much detail.” It’s hard enough making an experiment happen, to add a whole other layer of complexity might not really be all that helpful.
It’s important not to get too bogged down over-analysing things. In short: doing an experiment, even if it’s a long ways from perfect, is better than nothing at all and working from assumptions.
However, I’ve found that rather than being an extra layer this is a useful “sanity check” to make sure we’re scoping our experiments appropriately.
In this light, I hope that sharing these experiences and #flearnings is useful in getting the most out of your own practice of experimentation.
Originally published at Zumio.