For nearly two years now, we have been using controlled online experiments known as A/B tests in Flo, to test the “strongest” product hypotheses. A mature process helps us choose the most promising hypotheses.
It’s definitely worth investing in this process. If you learn to correctly formulate and prioritize your hypotheses, you will end up with more successful experiments. As a result, your product will improve faster and your users will be satisfied.
The responsibility for this process is shared by product managers, analysts, UX researchers, and designers. And in the case of Flo, it’s a team effort, as decisions are made jointly. That is why it is important that all the participants in this process know the basic definitions and principles of working with hypotheses in order to speak the same language.
Hypothesis Formula and Its Components
It is important that no one on the team has a monopoly on formulating hypotheses. Any team member should be able to do this, so everyone should know the basic principles.
The hypothesis is a statement that requires proof and can potentially be verified through experiments. The hypothesis has been formulated as follows: “if we do <description of changes>, it will improve user experience <user segment> and the <metric> will increase by <X>%”.
Therefore, a correctly formulated hypothesis has at least four mandatory components.
- Description of changes — what exactly do we want to change?
- User segment — who is our target group?
- The metric — how are we going to measure our success?
- The scale of metric change — what kind of change will be considered successful?
Let’s break it down into components.
Description of changes
The best way to describe changes is to do a small mockup that will help each team member to quickly figure out what exactly will change for the users. This can be done using special tools such as Figma (in case of UI changes) or Miro (if the changes relate to the server-end logic) as well as drawing a sketch by hand. People absorb and remember visual information much better than a text.
At this stage, we only need an approximate assessment of an audience segment. It is important to choose a specific segment of users with the same objectives and/or behavioral patterns.
Audience segmentation allows all team members to better understand the product users. There is no need to use advanced ML-methods for segmentation. For instance, we divide users into segments depending on the purpose of the app — we ask for this information during the onboarding process. The list of purposes has been expertly determined depending on the main objective the app users might have.
There are several ways to choose a metric.
In the case of Flo, we are guided by the company’s OKR. The company-level KR is a mix of business (monetization) and UX-metrics.
As a result, we can see from the start which of the indicators we want to improve. This enables us to limit the set of hypotheses and choose only relevant ideas.
A product team cannot always influence top-level business metrics in an immediate and significant manner. Moreover, these are often composite. One obvious example: a user LTV that in subscription apps looks something like this:
That is why our company is now developing a unified metric hierarchy, which will help:
- break up a business metric (such as LTV) into several components;
- link the components of business and product metrics;
- create several levels of product metrics depending on their sensitivity.
This kind of systemization makes the choice of metrics in a team much easier. Working from the top down, the participants can see more clearly what exactly they can influence.
It might seem that building a hierarchy is primarily relevant for sufficiently big products. But, in fact, it is advisable to decompose top-level metrics for building a basic hierarchy even for new products. After all, the main purpose of this exercise is for the team to find out which area can be influenced most effectively by each member of the team.
It is also worth mentioning proxy metrics that can be included in the hierarchy. Proxy metrics are needed to measure the effect faster, although some errors may occur. For instance, the similarity between binary metrics can be assessed using the Jaccard index.
The picture shows that the “retained M1” metric (return two months after installation, starting from zero) has a 60% similarity to the “retained up to D7” metric (return on the second to the eighth day after installation).
In the first case, we have to wait two more months to assess the effect, in the second — eight days.
It is not always easy to choose the right proxy metric because not all metrics are equally sensitive. Also, it is not always obvious how to deal with a tradeoff between accuracy and calculation speed. It will be easier to make a decision if the internal KPI is set using accuracy (similarity) criteria.
Business vs. UX
In some cases, business metric optimization can conflict with the User Experience. For instance, as a result of more aggressive monetization. Such cases can be easily controlled using health metrics, which respond precisely for the UX. The health and the main metrics do not, as a rule, work in tandem, and it is very important not to “drop” the health metric. The most popular example of a health metric is retention, but we also use other metrics on various hierarchy levels.
Metric measurement scale
And here is one of the trickiest problems: evaluating the metric change potential.
As a rule, at this stage, we cannot receive a super reliable evaluation. Therefore, we will attempt to obtain an approximate assessment, which can also be validated intuitively (we’ll talk about that later).
We know the audience segment likely to be influenced by the changes. But how do we obtain data about the effect?
- “Cold calculation”
In some cases, the scale of changes can be calculated. For example, before launching a new feature, you can ask the users to what extent they are interested in the launching of the new Feature X. With this in mind, you can assess the interval of potential conversion and the planned effect.
2. Record of experiments
The record of experiments should be kept at least for this reason. We find similar experiments that, for instance, were launched using a different platform / language / in another country or are similar with respect to the scope of changes (a simple example would be the color and shape of a button). We take the average effect.
If absolutely necessary, we can go through all the experiments which contain the metric we need and take the median value, 75th-percentile, or a maximum of change depending on your confidence in the hypothesis.
3. Market research
It is sometimes possible to recognize what would be the effect of a change from indirect evidence present in market research. As a rule, it applies to market hypotheses that were in one way or another picked up from competitors or just modeled on some other products. With well-known services such as AppAnnie, we can see how the update affected the top-level metrics.
It is advisable for this to work both ways: take a hypothesis / market feature and make sure that you check its influence on the product it was implemented for.
We can come up with a super breakthrough idea while being clueless about the effect it might have. We recommend that the Just Noticeable Difference be determined at every level of the metric hierarchy. The scope of changes will surely be seen at the level required. Then, if you start the experiment, you will always know how successful it is.
What Is a Good Hypothesis?
We have learned to formulate hypotheses. But we have lots of them, and we need to understand how to find the most promising ones.
We use a five-point checklist that allows us to simplify the process of choosing and to find good hypotheses.
So, a hypothesis is good if:
- it solves users’ real problems / tasks;
- it is substantiated by data analysis, UX, or market research;
- it is associated with a long-term product strategy;
- it noticeably increases metrics;
- it is quite easy to test.
It is easy to notice that some points from the checklist may in part contradict each other.
This is where the ICE prioritization method comes in handy, which will help us to create a hypothesis backlog.
The checklist is easily converted into ICE elements:
- it solves users’ real problems/tasks [I];
- it is substantiated by data analysis, UX, or market research [C];
- it is associated with a long-term product strategy [I];
- it noticeably increases metrics [I];
- it is quite easy to test [E].
This is done using the Data-Informed ICE: the analysts inform the team about the projected Impact, but each team member individually makes a decision with respect to the ICE assessment. This allows us to add a product intuition element to the process of prioritization.
Having obtained the backlog of hypotheses, we continue with the process:
- We launch experiments.
You can learn about the technicalities of our experiment service from an article written by the product manager of our data platform Kostya Grabar.
- We collect the necessary data and analyze the results.
Anton Martsen, our senior product analyst, is writing a dedicated article about it.
- We draw conclusions and decide on the rollout.
This is the way Flo product teams work. Internally, there are two interconnected and never-ending processes: working with hypotheses and experimenting.
If we are to work with hypotheses in a systematic manner, they should be recorded and documented.
At the moment, we are keeping and doing our prioritization in Google Sheets. We also use Google Sheets for connected hypotheses: a summary table of planned hypotheses that impact one Objective. This is not very convenient. We are currently looking for a more user-friendly tool that will allow us to more efficiently combine OKR, work with hypotheses, and experiment results.
We have been using the above process for the past several quarters and can note the following:
- greater involvement and interest on the part of the team members during planning, development, and analysis;
- fewer low impact completed initiatives;
- acceleration of experiments that have an impact on long-term metrics.
It is not that easy to correctly formulate a product hypothesis or prioritize. But if this process is set up and launched correctly, the speed and quality of your product development will significantly improve through a conscious approach and efficient teamwork.