Becoming Data Driven Level 2: Using Data to Avoid Shitty Decisions

“Using Data to Avoid Shitty Decisions” is just a polite way of saying “mitigating risk”. Level 2 in this series is all about reducing the likelihood of releasing a change to an online product that may not yield a return on investment or is potentially harmful. Whereas Level 1 was a technical exercise of tracking & reporting, Level 2 and beyond are more philosophical.

You Are The “Ass” In “Assumption”

When coming up with an idea to change an existing product or launching a new one, it’s important to understand what your assumptions are. If you know what your assumptions are — especially those in which you have low confidence— you can start figuring out how to convert those assumptions into facts. The question to ask yourself is: what is the best way to do this? Here’s a diagram that is 100% correct and peer reviewed by scientists to help frame the discussion:

If you can launch something and remove it without permanently harming your user experience (“roll back”), then you should try to test it on real users with an experiment, giving you high confidence in the results. Another approach is to perform research up front and seek out precedent, market data or perform user tests in a controlled environment. While this post focusses on the left-hand side of this diagram, both approaches are completely valid in the right context and even complimentary, just that confidence is higher when based on real user data. Here are some stories…

New Product MVP

MVP means “minimum viable product”. An MVP is a product or feature that contains the least amount of functionality or effort necessary to test the viability of an idea. When it comes to existing products, the ability to experiment on an existing user base is an advantage. But when no user base yet exists, you need to be a bit more creative.

According to their website “Rocket Internet builds and invests in Internet companies that take proven online business models to new, fast-growing markets”. This means that they are often mitigating some risk by taking a product they know is successful in one market and trying it in another market. The risk of failure is far from zero, though, and the well understood assumption of this model is that an idea that works well in, say, America will work well in another country. Therefore tactics need to be developed to test that assumption. For example, when they were assessing the viability of Shopwings, a service to deliver goods from supermarkets to your home, they sent a marauding gang of students to a local store to buy and take a photograph of each item. Each photograph was placed on a website along with a description and a reasonable price. Money was spent on advertising this website and, as far as the users were concerned, this was real and running. Once they had established enough confidence, they could raise capital, hire a full team to build a product and deliver goods. Rocket have become pretty good at this type of technique and have used it to launch a lot of companies.

A more famous example of the same approach is online shoe retailer Zappos. Founder Nick Swinmurn took photos of shoes from a local shoe retailer and built a website to sell them. When orders came in, he went to the store, purchased and shipped the shoes. At a low cost, this addressed the assumption that people were comfortable enough taking the risk of buying shoes online. It’s fun to note that Zappos inspired Rocket Internet to create Zalando, which is now one of Europe’s largest online fashion retailers.

New Feature MVP

An MVP doesn’t have to be a whole new product. It can be a new feature on an existing product, allowing you to run experiments or perform user testing, depending on your ability to roll back. The same principle as a new product MVP applies: we are trying to apply the minimum amount of effort to test assumptions about our new feature.

An example that covers a lot of bases was a small experiment run at LYKE, a shopping app for fashion-conscious Indonesians. A significant differentiator for LYKE is that recommendations form the core of the user experience. This means the contents of the home screen, search results and push notifications are heavily influenced by prior browsing and purchasing behaviour rather than user-entered search criteria or lists curated by staff as is traditional. However, building enough data to generate recommendations requires a user to browse or buy, thus the experience for new users isn’t as great as existing users.

To overcome this, LYKE decided an onboarding flow would be introduced, whereby users are asked questions about their tastes when they first open the app. Research was done, designs were made, users were tested in a lab and there was a good level of confidence. The biggest risk was the assumption that users would want to provide this information if they understood the value of doing so. To test this assumption, an MVP was designed containing questions about shopping preferences but without any other features required for the finished feature. A variant group of 50% of new users were exposed to this, arbitrarily splitting between odd and even device IDs (more about splitting audiences further down). No recommendation models were built around it and the data wasn’t stored. Google Analytics events (so simple…) were fired to track how many people closed the app while onboarding and at what stage, how many didn’t make it to the home screen and how many engaged further and made purchases compared to a control group who received no such experience. The tracking events fired for the control and variant groups were named differently for easy comparison. The experiment showed most people closed the app. Something valuable was learned at a low cost: most users just wanted to shop and anything in the way of that is an unwanted distraction.

A key point in the example above is that the value of this project to LYKE was divergent from the value to the user. LYKE valued its personalised experience but most users just wanted to shop unaware of the science steering their behaviour. By experimenting in this way, the team failed cheap and learned a lot about what their users want in their app, understood the real impact of interrupting their flow and helping steer future product decisions. Using the data from the first experiment, they could understand the drop-off point and the sweet-spot of questions. With this data, they tried again with much more success… This time with a “Skip” button.

Types of Experiments

If you haven’t noticed, I’ve used the word “experiment” a lot. In the examples above, I’ve covered two different types of experiment:

  • Wizard of Oz testing. This is an MVP that seems completely functional to the user but, under the hood, is something far more manual, like the Rocket or Zappos examples.
  • A/B/n testing or multivariate testing. Slicing your existing audience into two or more random “buckets”, showing them slightly different versions of your product or feature and comparing the difference in behaviour between the two groups. As explained in the LYKE example.

These are special because, as far as the user is concerned, they are real. This gives you lots of confidence as to whether you should pivot (change your plans), give up or invest further based on the data you get. You can also test the validity of your ideas through surveys, research, concierge MVPs or just straight landing pages explaining what a product would do if it existed with some way for users to express an interest. While I love A/B/n testing & stories of Wizard of Oz testing, there are some hazards that people often run into…

Experimental Hazards

When you bucket users for your A/B/n test, you need to ensure reasonable distribution of users between buckets. LYKE split users on odd and even device IDs but this is a terrible idea for two reasons. First, if you run two experiments at the same time without randomising users between buckets, then one set of users will see the variant of both experiments and one set of users will see the control of both experiments. You actually want it to be much more mixed than that so you can determine which experiment caused which impact.

Two mice

The second problem of failing to randomise users is that if you run your experiments sequentially, you will have one audience that is always being experimented on and another audience that is “clean”. Over time, this may start impacting the underlying behaviour of one group when compared with another. If your experiments often change fundamental parts of your flow then it will become frustrating to those “variant users” if using your product is unpredictable or unstable. Running experiments sequentially will also make it a pain in the ass to schedule work.

The next common hazard is looking at experiment results and saying “yeah, good enough” ¯\_(ツ)_/¯. If you run an experiment, you need to be confident the results are not random. Imagine you flip a coin 1000 times, presumably it will land on one side slightly more than the other. If you do the same thing again, assuming the coin’s weight distribution is even, there’s a 50% chance it will land on the other side more frequently. To become confident your results aren’t random, you can test for “statistical significance”. There are an abundance of statistical significance calculators for you to pick from, like this one, this one, this one or this one. The latter being my favourite despite needless page timeouts.

Statistical significance is a great way to work out whether a change is likely to be the result of random chance but there is a reason to occasionally and temporarily ignore this: some changes take time for users to adapt to. If they are used to having a button in one place and you move it elsewhere, expect them to be somewhat frustrated or surprised before they recover again and adjust their habits. I’m unsure if there’s a good formula to account for this but when I’ve worked on iOS and Android native apps, we often assume a two or three day behavioural change after an update and ignore this period for analysis purposes unless it is catastrophic. We ignore it partly because engagement tends to increase after an app update and partly because users may be going through an adjustment if the latest release contains changes to the core flow of the app.

It is very important before starting an experiment to have a clear goal in mind that you can measure, ideally a goal that improves your business and this is different from measuring the uptake of a new feature. It is, of course, important to measure whether your new feature is being used but it is more important to understand if that feature is making your product more effective. Say you run an experiment on your sock sales website for a new Wishlist function. This “Wishlist” allows users to save their favourite socks for later viewing. You can certainly measure how many people use this function, but it is more important to ensure that the people with access to the Wishlist then go on to make a purchase (“convert”). If lots of people use the Wishlist but don’t buy for some reason, then the Wishlist may be clutter.

The final hazard is what I call “emotionally over-investing”. When you build something, it is natural to want it to succeed. Should the data show what you have built isn’t working as you had hoped, then investigate why as you can learn something. But be careful not to look for reasons to justify the success of your change. If the Wishlist example from the previous paragraph results in users Wishlisting a lot but buying fewer socks, that might be unexpected. If you investigate and find that drop to be real, feel free to change the feature and try again but don’t launch this unchanged unless you are comfortable temporarily selling fewer socks.

Further Reading

Wrap Up

We covered how important it is to recognise assumptions and some ways to test them to increase confidence. Then I went on a long rant about what to avoid when running an experiment. The key takeaway from this is that there are reasonably straightforward techniques to establish whether the thing you’re about to release is a good idea and to learn about your users. You can even use experimentation to take real risks with real users, so long as you can roll back in a dignified way. If you cannot, then you should weight your efforts more towards research and user testing.

The next post is about using data (such as data gathered through experimentation) to help guide decision making. See you soon! Bye now! Bye!