Short-term to Long-term Impact
A framework to get to long-term impact from short-term impact
What is it about?
AB test is a standard methodology to validate a hypothesis and to get a sense of the impact on the bottom-line. Companies run experiments for days, weeks, or in some cases for months. The lift on metrics that we see in the experiment period (say two weeks) doesn’t hold in the long term (say four weeks or twelve weeks).
Of course, it is rarely practical for a business to run an experiment for longer duration solely to measure the long-term impact. That’s why we need an analytical way to extrapolate the short-term impact that mimics user behavior.
Why do we need long-term impact?
Launching features based on 2–3 weeks of data may optimize the experience of the product for users who are in a hurry. The final product post thousands of optimization may not serve the need of all the user persona. Additionally, a consumer-centric product feature that takes a long time to show benefit will get deprioritized for a product feature with little consumer value with quick benefits.
The product experience may morph into a sleazy salesperson.
As an extreme analogy, we could think of the strategy messenger employed to grow the user base. A look on only short term impact would have never convinced messenger team to go ahead with the tactics.
Why long-term impact is different from short-term?
First, let’s develop some theories on why long-term impact would be different from short-term. The understanding would help us choose a model to extrapolate the short term impact.
We need to zoom into three concepts — User mix, Fatigue, and education. User mix refers to how the mix of users is changing with time. ‘Fatigue’ and ‘education’ is more about user behavior — how users respond to the test, and how the response changes with time.
- Mix: The mix of incremental users getting exposed to the test changes with time. It refers to the profile of users that are getting exposed to the test. More frequent users visit the product more often. No wonder, when we plot the distribution of users with time, we see that the user mix is skewed towards low-frequent users in the latter part of the test.
- Fatigue: Experience fatigue or offer fatigue. The idea is that the response rate or adoption goes down as the novelty of the experience wears off.
- Education: Some product features require training. The more users understand the product, the more they like it. Users don’t go crazy the first time they get exposed to it, but it grows onto them.
Let’s get deeper:
First, let’s tackle the user mix conversation. It is easier to understand, and it is easier to model.
For illustration, here is one example of how mix changes over time.
Focus on the blue line first. It represents the absolute number of incremental users getting exposed to the test. Each product or product components may have a different blue line, but most often, it would look similar to the below curve. Incremental users getting exposed to the test goes down unless we have a product without repeat visitors.
The bar chart is telling us that the mix of users that are getting exposed to the experience is changing week by week. A significant chunk of additional users getting exposed to the test is new users as time goes by. If response rate/lift for ‘first-time visitors’ and ‘repeat visitors’ are different, it is almost certain that week over week response rate will change.
‘First-time visitors’ and ‘Repeat visitors’ are one among many dimension to look at the data. We can break or group users by many dimension/segments. It could be account age ( New, 30-days, 90-days, 90 d+), Frequency segment ( Daily visitor, Weekly visitor, Monthly visitor), or it could be a segment like New, Retained, and Resurrected. We can have a fancy segment if it helps. What we need to keep in the mind is, the more the lift and response rate varies by segment values, the more effective it is in extrapolating the impact.
One more important segment we need to look at before we get into the ‘user behavior’ section is ‘exposure’ segment. Exposure segment splits users into groups based on # of times they are exposed to the new experience. As time goes by, more % of users get exposed to the test more numbers of times, so the mix changes.
User behavior — ‘Fatigue’ or ‘Education’ or a combination of both
Response rate by exposure segment helps us understand ‘fatigue’ and ‘education’ piece. The curve represents how users respond to a new experience. If a new experience requires learning before users start responding to it, we will see the ‘education’ curve. If users interest starts wearing down as they get exposed to the new experience more, we’ll see the ‘fatigue’ curve.
The below charts plots the curve for pure ‘e’ and pure ‘f.’ For each experiment, the steepness of the curve would be different. Pure ‘e’ or ‘f’ may not exist in the real world.
The actual curve could have a shape like below graph. First, we may see the lift going up, and then we may see tapering off effect. It is because users may take ’N’ exposure to get used to the experience and post that we may see the wearing off effect.
Getting to the long- term impact
So far, we have broken down the user behavior into components — user mix and education/fatigue. We have developed a thought process on how user mix will have an impact on response rate, and how education or fatigue will have impact on response rate. We also know that ‘education’ and ‘fatigue’ can both exist for an experience. ‘Education’ may trump ‘fatigue’ early on, but ‘fatigue’ eventually cathes up. It gives us an excellent base to get started with the extrapolation process.
Any forecasting or extrapolation starts with a set of assumptions. The sanctity of premises determine the robustness of the model. If the assumptions hold, our model holds. Let’s get started with the assumptions.
The Behavior of users from Segment X with value Xi and exposure segment Ei tomorrow will be the same as users with similar segment cell today.
A couple of points on the assumption:
- Segments need to chosen based on the user behavior of the product. In general, the more the response rate varies by segment values, the better the segment is for extrapolation.
- We could choose multiple segments or no segment at all on top of exposure segment. The choice on ‘# of segments’ and on the specific segment that would be used for modeling must be made by an analyst who has thought deeply about user behavior. Typically, it is not a good idea to use multiple variables. Read more about Bias-variance tradeoff here.
Above equation lays down a way to calculate the response rate at any given time. What it says is that at any given moment overall response rate or lift is nothing but weighted average of response rate for all the segments.
Let’s say we want to extrapolate 2-week impact to get to 4-week impact.
For two weeks of data, the equation would like below
Now, to get to the 4-week response rate or lift, we are going to use the formula like below
Please note that we used r(2)(i,j,k)to get to r(4). It is because of our assumption. We expect that r(i,j,k) doesn’t change with time. We can get more confidence on the assumption by looking at r(1)(i,j,k) and r(2)(i,j,k). If it makes sense, we can use both the numbers and some heuristics to get to r(i,j,k) that we are more comfortable with for extrapolation.
You may have noticed that we have talked about extrapolating 2 weeks impact to 4 weeks impact or to 12 weeks. We can use the framework that we discussed to get 1-year impact for some products. It is applicable for longer duration for products that have long relationship with consumers.
Special shout-out to Vidhya Kannan and Jing Zhang for hours of brainstorming. Thanks for all the work that was needed to make the concept a reality at realtor.com