7. Testing Hypotheses with Experiments
Remember how a product manager needs to be a kind of scientist? Perhaps a mad scientist at times if you dream big. But always grounded in evidence, taking care to measure things accurately.
Scientists learn everything they can about a subject and wonder about the parts nobody seems to have a clear answer for yet. They develop hypotheses. This is the creative part!
Towards the end of this year Rosenfeld Media will be publishing my book Product Management for UX People: From Designing to Thriving in a Product World (you may sign up there to be notified when it is available for order), the culmination of a multiple-year project that has unfolded with the input and support of the Design in Product community.
During the editorial and production process I am sharing early drafts of chapters, and I welcome any feedback here that will strengthen the material.
Hypotheses are just ideas about why things are the way they are. Once you’ve got a hypothesis, you can work on coming up with ways to test it, to prove it true or disprove it. Whether the hypothesis is right or wrong a good test will teach you something about what is going on.
UX Superpower Alert
Generating hypotheses may sound like something you do in an ivory tower or laboratory but it’s just a fancy way of saying coming up with theories and ideas you can test out. Any design exploration you’ve ever done has been a way of testing hypotheses. User research scripts are based on hypotheses you wish to explore. Don’t let the science-talk scare you. You’ve got this!
Experimentation as a way of life
A lot of product teams segregate experimentation from other software development activities (build and fix, primarily), either focusing a small team of engineers on experiments (possibly working with a growth-focused product manager or even a growth “hacker”) or rotating experiments in on a cycle such as every third sprint.
But this can be a reductive way to think about experimentation, almost always centered on the popular expedient of “bucket testing” (also called split testing, A/B testing, and multivariate testing).
The truth is that product management entails an endless series of “bets” that need to be tested and played out. Some are about launching something new and others about how to make things better (as discussed in Chapter 5, “The Business of Product is Business”), but they all involve developing working hypotheses of what is going on, what is hindering progress, what would unlock better outcomes, what to focus on next.
It is better to recognize that experimentation is a way of life and is threaded through all your decisions.
Build vs fix vs tune
One broad way to break down what a software development team can be working on at any given time is to note that the team is either building something new that does not yet exist, fixing bugs or other perceived deficiencies in some software that does already exist (but may or may not yet be released), or fine-tuning software that exists and has users and has the potential to get better.
Experimentation plays are role in all of these phases of development:
- The determination about what to build, for what audiences, addressing which pain points they have in their current ways of doing what they do, and figuring out which “job” they will be willing to “hire” your software to do for them.
- Fixing bugs isn’t in and of itself particularly experimental, but prioritizing which bugs to fix (because bugs are kind of fractal-y and you can’t really ever fix them all for every scenario on every platform and device ever) represents a bet or testing of a hypothesis about what must work seamlessly for your customer and what will be good enough if it’s good enough.
- Tuning for improved outcomes is almost entirely a matter of experimentation, and it’s also the context in which most practitioners are highly aware of the experiments they are running.
A wise man once said “Hypotheses are ideas about why things are the way they are.” Often, for a product manager, hypotheses are more specifically attempts to explain the perplexing or unpredictable results showing up in the data. You’ll hear more about product analytics, metrics, and data analysis in the next chapter but if you recall that a PM tends to be obsessed with certain key “north star” metrics, frequently going so far as to arrange for a daily morning email or slack update, and standing reports and charts that can be pored over whenever time presents itself.
One reason you might look at the same key north star metrics every morning is so you notice when they go wonky. Why did sales drop to zero at midnight last night? Why are downloads 4x the usual number today? Why are we trending on Twitter?
As you try to answer those questions, the ideas that suggest themselves to you as explanations are your hypotheses, but of course not all guesses are created equal, so it behooves you to make your hypotheses crisp and testable, and to have colleagues who you can discuss or sketch these ideas with and who can be your “thought partners” in refining them or riffing on them or nailing down the implications.
Because once you have a hypothesis you like, you’ll need to come up with one or more experiments you can do to test the hypothesis.
At 7 Cups, our service relied on volunteers trained to provide active listening to people seeking free emotional support online. We called these volunteers our Listeners. At one point we had a button on our global menu that read “Become a Listener” and we felt it could perform better as a recruiting affordance.
A conversational designer on my team, Heather Cornell, suggested the hypothesis that “New visitors to our website don’t know what a Listener is, let alone why they should want to become one.” This seemed pretty compelling but how could we test it?
Cornell proposed we try a different label for the button, such as “Volunteer as a Listener” or “Become a Volunteer.” Both of those alternatives performed better than “Become a Listener” did, and “Volunteer as a Listener” performed best of all. You can still see that option in the top menu at 7 Cups last time I checked (Figure 7.1).
A hypothesis and experiment led us to the realization that there are a lot more people out there looking for volunteer opportunities, it turns out, than looking to take on a specific named role at one particular volunteer organization, as seen in the global navigation at 7cups.com.
What’s interesting is that a modest hypothesis (that people didn’t know what we meant by “Listener”) led to a significant improvement, and the subsequent experimentation revealed a further insight (there are people out there specifically looking for opportunities to volunteer).
Proposing and prioritizing experiments
As you break down the product you work on into components or functional areas, you’ll find it’s quite possible to generate many hypotheses about each piece of functionality or flow in each area. Ideas are a dime a dozen, and the product manager’s job is to provide focus by facilitating prioritization.
For any given hypothesis that you have deemed tackles a serious enough high priority goal, the next job is to come up with experiments to test this hypothesis. Part of this is a matter of logic. If your hypothesis is that there are too many ads on the inbox screen of your mail app, and that this is leading your users to ignore all the ads, an experiments to test this hypothesis might be to reduce the number of ads displayed. If the hypothesis is correct, this should lead to more ad engagement. If not, then there are likely other factors at play.
But not all hypotheses are so straightforward as to suggest obvious experiments, and a failed experiment does not necessarily disprove your hypothesis (it might also be that the experiment failed to test the hypothesis effectively), so there can also be an element of art or creativity in coming up with experiments that efficiently and elegantly zeroes in on the “hinge” of the hypothesis to reveal the impact of making the change
Table 7.1 shows a set of hypotheses paired with experiments that might help prove them one way or another.
Table 7.1: A set of hypotheses and some experiments to test them
Also, hypotheses and experiments are not always one-to-one. You can, and often should, come up with multiple ideas for experiments to test a hypothesis. This may be because you are searching for the angle that most effectively tests and reveals the potential of the idea, and it can also occur when you run a successful experiment and want to go back to the same well to see whether a follow-up experiment will yield further benefits on top of the optimization or success already achieved.
Table 7.2 shows multiple experiments generated as possible ways to test the original hypotheses.
Table 7.2: Two or more experiments for each hypothesis
So after you do this for a while you’re going to have a big pile of hypotheses and an even bigger pile of proposed experiment to test these hypotheses, which bring us back to prioritization.
One thing you learn quickly is you can’t test everything. You can’t research everything. You can’t answer every question or worry you have, and you can’t eliminate risk. But reducing risk (or “de-risking” in the jargon) is one of the primary goals of experimentation, which means that what ultimately should determine which experiments you run and which ones you put off comes down to identifying the riskiest bets you are considering and doing what you can to mitigate those risk in particular.
For example, tweaking the color of the button to try to catch more eyes may be something you can just go ahead and try without testing anything first, because the existing risk of maybe not having the ideal button color is not huge and the downside risk of being wrong is probably also not game-changing.
But now imagine that you have several important goals in tension (which you always do, by the way!). Let’s say you need to increase growth by juicing signups and improving retention around free services, but you also have a significant revenue stream from impulse buyers that you need to maintain and ideally grow as well. In this scenario, it might be easy to make the option of signing up and engaging in the free services much more attractive and prominent than the impulse-buy call to action, but the risk of doing so is tanking a critical revenue stream. In that case, rushing ahead to make the change feels too risky, and an experiment on a small subset of your traffic can be a good way to study the tradeoff before betting next quarter’s payroll on a longer-term growth ideal.
One other factor that complicates your prioritizations of experiments is that running more than one experiment at the same time in the same area or flow of your product leads to muddy, hard-to-interpret results. So as long as you are running experiments that don’t conflict, you can have several in flight at the same time, up to your ability to keep track and cope.
To dig deeper in how to do this at scale check out “It’s All A/Bout Testing: The Netflix Experimentation Platform,” a post on the Netflix’s tech blog.
You’ll want to both maintain your backlog of hypotheses and proposed experiments and constantly be prioritizing the upcoming tests and tracking the tests you have under way. Completed experiment, whether scored as wins or losses, should generate some insight into the validity of the hypothesis (or, at least, the efficacy of the experiment), and these insights — stacked up over time — belong with the rest of your research findings, likely in a repository or tracking tool shared with UX and data teams.
Whichever tool you use (I like to use an Airtable template, as shown in Figure 7.2), you’ll want to develop an agreed-upon rubric for ranking potential experiments and choosing which to prioritize for upcoming sprints. Factors to consider include:
- The potential reach of the experiment (how much traffic flows through the area of the test)?
- The potential impact of a successful experiment (somewhat subjective, but are we shooting for a 10% improvement in the metric being tracked? 50% 2x? 5x?).
- The engineering and other staff effort required to complete the experiment.
- How confident you feel and what evidence you have to support the hypothesis.
Figure 7.2 shows an Airtable template used to prioritize, track, and score experiments that was initially provided to me by product and growth expert Jesse Avshalomov and tweaked over time or various projects and clients. (Airtable is a relational database tool with a fluid user experience, that some product managers find extremely flexible and useful in corraling and tracking complex moving systems.)
It helps you score experiments in terms of potential impact, effort, confidence in the hypothesis and a few other factors. (You can overrule the score when deciding what to do next but it helps a lot with the broad strokes of prioritization.)
How to Run an A/B Test
Factors to consider when running bucket tests (I use all these terms interchangeable: split test, bucket test, multivariate test, A/B test) include the following:
- Statistical Significance
As with any development work, including building and fixing, running prioritized tests has to be weighed against other items in the backlog and then prioritized into sprints. Tests can in many circumstances run longer than a single sprint, so starting the test, overseeing the test to make sure it is running properly and no bugs have shown themselves, and then ending the test are all distinct tasks that you need to track individually.
The reason tests take varying amounts of time to resolve themselves is that they don’t yield meaningful results until they have included a statistically significant number of subjects, and depending on the traffic flow through the area where you are testing, getting enough people in “each” bucket can take anything from a day to many weeks.
So how do you know when you have achieved statistical significance? There are mathematical models to help determine this and software tools that facilitate A/B testing (such as Amplitude) now include both measurements of statistical significance as well as likelihood that the result is correct, but as a very broad rule of thumb, you tend to need at least 2000 people in each “bucket” before you can trust a result.
One way to test this, which can be very interesting, is to run what I call an A/A test. Basically, you set up an A/B test and put new users into one bucket or the other, but you serve up exactly the same experience to people in each bucket. What you’ll see early on as the data starts rolling in is that B performs much better than A, or the other way around, no wait now it’s changed again! Just as you might flip a coin four or five times in a row and get Heads every time, if you flip it enough times, the number of Heads and Tails results will end up nearly equal (unless you are in a Tom Stoppard play).
Keep an eye on your test and note when the results approach parity and then stay pretty even from that point on, and don’t be surprised if it’s around when there are about 2000 people in each group.
It’s extremely important to determine when you will end the test before you start it. Otherwise, the temptation is extreme to “cherry pick” by ending the test when the result you are rooting for is ahead. Similarly, you need to end the test when it has reached significance, instead of “just letting it run a little longer” in hopes that your underdog will still pull it out in overtime.
Once you’ve ended the test you have a measurement of its impact. Did more people in the varied bucket do the thing you wanted them to do? Was it a wash? Or did it actually depress the results? Any outcome is fresh information and welcome, but of course winning is better! Either way, you need to track and record these results.
One of the biggest dangers of A/B testing is “polishing a local maximum” which means optimizing something to the utmost degree without realizing that there are much more valuable (Figure 7.3).
In addition to tracking the numerical impact on goal metrics achieved by each AB test, you also need to make a qualitative assessment of the meaning of the results. Did they validate the test, partially or in full? Did they reveal a flaw in the hypothesis (or the test itself). Do they suggest further hypotheses or additional tests to try.
This is part of your product’s intelligence arsenal and building that up over time is even more important than “winning” the current tests you’re running.
It’s true that you learn things even from tests that fail to improve the results you were shooting for, and some people will even argue that you tend to learn more from failures than you do from success (although I believe this only holds if you take success for granted and refuse to examine and reflect on it as thoroughly as you do your failures).
However, winning is better!
And once you have a win, you can end your test and “lock it in.” Now instead of just one half of your users or one half of ten percent of your new users, getting the benefit of the improvement found by the test, everybody can get it on it.
Then it’s time to see if you can stack some more wins on top of that first one. Was the test as aggressive as possible or did you hedge your bets to avoid disrupting other metrics. Can a further test take things up another notch. If there isn’t any easy variation on the successful test to try, what about the other experiments that came afterward in the prioritization stack-rank? Try one of them! In some cases you can return to the same well multiple times and turn a 20% increase into an 100% increase, or a 5x improvement into a 10x improvement.
The losses teach you something, but those locked-in wins can stay with you forever.
Problems with A/B Tests
A/B tests are the shiny thing that product folks gravitate toward. They seem easy to explain and understand but they can be misleading and fill you with a false sense of confidence.
Beyond the risk you’ve already seen of polishing a local maximum, there are several other major pitfalls with relying on this type of testing to make decisions.
Most of them boil down to two major themes:
- There is no way to know for sure if externalities have affected your test and if running the test again at another time under other circumstances would get the same result.
- At best you know what is happening but an A/B test cannot by itself tell you why.
The first set of problems relate to overinterpretation. In some ways a statistically significant test can disguise the guesswork surrounding it. (At least when you have mere “directional” signals you are forced to be skeptical, to verify patterns, and to investigate potential reasons for the behavior you’re seeing.)
The second set of problems involve projecting subjective qualitative interpretations of the data without validating them. As with so many other metrical signals you might have available to you, such as customer satisfaction or net promoter scores, user feedback, ratings and reviews, complaint volume to customer support etc., the real job is to do your research, interview users, and get deeper into why things are happening they way they are and not just what is going on.
From the Trenches
Another issue with A/B tests is that they are not always possible outside of mass-market direct-to-consumer products. As Clement Kao explained when sharing a day in the life of a B2B product manager, not only is the user base of many business products too small to generate statistically significant traffic, but the customers are not anonymous data points but rather specific businesses and people involves in high-touch customer success relationships. “Experimenting” on these customers by showing half of them one interface and half another is a disruptive nonstarter:
“In B2B, you lose a lot of these assumptions of you can’t actually A/B test because someone is trying to use your product to run their business. So if they have to train one cohort of users to use one workflow and another cohort to use a different workflow, you definitely cannot do that. Similarly, it’s not helpful to recruit a random ‘enterprise user’ when you’re trying to go after a specific set of like customer accounts.” [Clement to send cleaned up quotation draft.]
Beyond A/B Tests
Product managers have a bad habit of equating all experimentation with A/B tests (just as some reduce all user experience research to usability testing). Remember that experimentation is woven throughout product work at every level. So, more specifically, what are some other forms of experimentation to consider beyond the ever-popular bucket test?
- Variations on A/B tests
- Concierge and Wizard of Oz tests
- Smokescreen and fake door tests
- Broken Glass test (aka Hard test)
- “Eating your own dogfood”
- Partial rollouts
- Beta programs
- Holdback or Holdover experiments
- Sales experiments
- Process experiments
You can find a nice scheme for organizing these methods in Itamar Gilad’s Testing Product Ideas Handbook (requires free newsletter signup to access) in Figure 7.4. It distinguishes experiments from other forms of idea validation, but this is also a matter of semantics and you can include most of the methods in the Tests stacks as experiments in the sense meant here.
Variations on A/B tests
Granted that A/B tests have their limitations, if you’re going to use them, you should know the range of related experiments, such as A/B/C tests and multivariate testing. In an A/B test typically A is the control (the existing experience) and B is the variant tested against it. In an A/B/C test, you split the traffic into three equal sized buckets and compare the control against two different variations.
A multivariate test is even more complicated. In this scenario, you are trying to test several things at once. To use a trivial example, consider button color, shape, and text. A multivariate test then splits users into sets of buckets. So some users may get the red button with rounded corners that says “Get in there!” and others will get the green button also with rounded corners that says “Start today,” and so on. Statistically, this requires even greater traffic to have each of the crosstabs (combinations of all possibilities) receive significant numbers, and the interpretation of the results is likely challenging and tricky.
Teams running a lot of A/B tests in flight are often really running disorganized multivariate tests without being aware of it.
A concierge experiment is one in which the service provided is not being handled by a software algorithm but actually by a human being behind the scenes (sometimes an intern).
Note: This is not to be confused with “concierge features,” such as a chatbot that helps you find and trigger option in a product.
The role of the person can be explicit or disguised but the point of a concierge test is to learn what customers value and what processes, workflow, language and other choices work best. Once this is determined, you can get some code written to take the humans out of the service delivery, and enable the feature scale up.
If you’re up front about it being a human-driven service, as Tom Kerwin puts it, you can “get a ludicrous amount of real world data and experience from having the conversations and solving the problems in realtime.”
By the way, a related concept that is likely on its way out as terminology is “mechanical turk” (not to be confused with Amazon’s crowdsourced gigwork service), based on an 18th century chess-playing hoax in which an apparent mechanism disguised a human operator.
Wizard of Oz
Wizard of Oz testing is just about the same as concierge testing but it involves providing a convincing user interface and hiding the fact that the back end is just a person typing and performing tasks for the end user.
A startup called Aardvark used this approach to validate its “social search” value proposition, using cheap labor to experiment with robot response strings and to do the manual work of tracking down people in the network who might be able to answer a question. The intern would pose as a bot while texting with both the person doing the search and the person identified who might be able to provide an answer. (They sold the company to Google in under two years.)
Amazon reportedly also used this approach to initially develop their “people also liked” recommendations. They were manual at first to prove they generated enough income and interest to be worth investing a ton of data and algorithm development. Zappos launched its entire business this way.
From the Trenches
Tom Kerwin: “I had an amazing experience last Summer doing a Wizard of Oz MVP with a small team. Eight days from team formed to first value delivered to customers. Then we iterated weekly. The first send took the whole team three days to hand-crank. Each week, the engineers fixed the most frustrating part, and after about eight weeks we were down to 15 minutes.”
Coined by Alberto Savoia, a pretotype is a “fast, low fidelity version of your concept — be it a product, service or business — that is just complete enough for you to generate real, data-driven validation. It generally differs from a prototype in that it seeks to answer the question of ‘should’ it be built, not ‘can’ it.” (What is a Pretoype?).
Not to be confused with “smoke testing” (which means a basic set of functional tests run to make sure nothing is broken, from the idea of firing up a machine and seeing if it starts giving off smoke!), a smokescreen is a promotion for a product that does not yet exist to determine the level of demand.
You may recall the story of how Jay Zaveri proved there was a pent-up demand for “Word on the iPad” by running ads claiming to have the solution ready. The signups generated by this ad were the result of as successful smokescreen test.
A fake door is analogous to a smoke screen but rather than being a landing page or sign up form, it is presented as a real feature in the product’s interface. When a customer tries to use the feature, they are instead presented with a promotion for the feature-to-be and sometimes a way to register interest (such as asking to be notified when it’s ready). The traffic to this phantom feature is one way to gauge interest.
There is a risk in this kind of test of frustrating your users!
Broken Glass test
Much like an intentional version of the product-market fit assessment, a Broken Glass or Hard test involves offering a feature but deliberately making it difficult to access or use. This is a way of determining if the demand for the feature is strong enough to invest in developing it further.
Dogfooding, or “eating our own dogfood” means testing a feature in house on your own employees before rolling it out to customers. Google famously did this with a great deal of success for Gmail and with a great deal of failure for Google Plus. Your employees are not always the best proxies for your customers but one real advantage of “dogfooding” is that it makes it much harder to ignore usability issues and other frustrations when they are hampering your own ability to get work done.
When planning a significant change to an existing product with a substantial user base, a partial rollout allows you to gauge acceptance and adoption and to troubleshoot issues that may not have presented in research, design, or usability testing.
A common approach is to roll out a new feature first to just 10% of the user base and monitor the response closely. If a problem occurs, roll back and fix. If all seems well, then roll the feature out to 20% of users and repeat. At some point you may feel comfortable jumping straight to 50% of users and then eventually to all.
A beta program is another way to offer speculative and new features to a core dedicated user group that is willing to test things for you that are not fully baked. Once features are validated, they can be rolled out to non-beta users and beta users can start playing with even newer feature ideas.
A Holdover or Holdback test involves transitioning the product to your new feature or change but keeping the old way for a small group of users to keep track of the effects of change over time. As Ryan Rumsey, founder of Second Wave Dive, puts it, “The holdover is a nice way to look at performance over time. I think many teams assume an initial test result = same results over time. I’ve found many features were used initially because they were new, but then dropped back after 90 days.”
Outside of the product’s feature set and user experience, you can apply experiments to other aspects of the value chain, such as sales or marketing. One example of such an experiment is called a “pitch provocation,” which involves trying out one or more provocative pitches to determine which best makes the case for your solution. A pitch provocation takes the form “You’ve got a big problem and we can help.”
Tom Kerwin says, “It’s a way to help tease apart your understanding of possible value propositions and problems. We create several extreme-and-likely-wrong versions of each, and then get prospects/participants to react: to tell us what they think these mean etc. From that, we can triangulate a better sense of the space.”
Remember that being agile means constantly evaluating how your team is working and looking for ways to improve it. Beyond passively noting process problems and then looking for solutions, you may also consider experimenting with variations in your processes to determine what works best.
As Ryan Rumsey puts it, you can ask yourself questions such as “What happens with decision-making or velocity when we change our storytelling structure?” and then experiment with those changes to see what impact they have.
If you’re ready to embrace experimentation as a way of life, get ready to look at everything through that lens!
A Day in the Life of a Startup PM
Nicholas Duran, senior product manager, Suvaun, a healthcare benefits upstart.
How mature (or how long established) is the organization you work for? Four years old with startup mindset
Share anything else that might help describe the environment in which you practice product management: Young company recently acquired, growing a technology platform for a multi billion dollar industry and captive audience that is largely resistant to change.
How do you spend the early morning? Typically review/update/organize the to-do list for the day and week. Hit urgent items first and then on to meetings, planning updates and documentation.
How do you start your workday? AM family routine, coffee, more coffee (maybe a quick peek at online news and feeds for noteworthy industry updates), quick system check for any new tasks, cal invites, or monitoring alarms. Then a healthy morning standup meeting with the team. After that it is off to the races.
How do you spend most of the morning? Clearing road blocks and doing correspondence.
How does the morning end? The ongoing routine and clock blur into a state of hunger. Then a decision is made whether there is time or not to proceed on to lunch.
When do you take a lunch break, and what do you have for lunch? Mid day is a typcial lunch range — anywhere between 11AM and 300PM. working remotely the menu has turned into quick grab from the fridge in the form of sandwich, salad or snack. Maybe a 10–15 min step outside for some sun on the face.
What do you do first in the afternoon? Check email and recover from lunch
How do you handle “firedrills” or other unplanned work? Carefully! Check priorities, assess risk, and schedule accordingly. Depends on how severe.
How do you spend the bulk of the afternoon? Meetings, use cases and operational excellence improvements
What do you do at the end of the workday? Refill my coffee, update my notes for the next day, check feeds and updates online, check LinkedIn, and thank the team for another great day.
Do you work in the evening? When necessary.
- Experimentation is a way of life for product managers
- Build, fix, and optimize are all aspects of development you can experiment with.
- Develop testable hypotheses about how to improve results and fix problems.
- For each hypotheses, generate as many ideas for experiments as you can.
- Prioritize both hypotheses and experiments rigorously.
- Use experimentation to “de-risk” your riskiest bets.
- Don’t reduce all experimentation to A/B tests.
- Make sure your test results are statistically significant.
- Be careful not to focus on relatively trivial improvements.
- Stack up your wins!
- Experiment widely, not just with feature variations.
You can sign up to be notified when Product Management for UX People is available for order at Rosenfeld Media.