Reflecting on a year of experiments

Luca Albertalli
11 min readJan 6, 2019

--

A few weeks ago I left Shopkick. For almost three years I’ve been a Product Manager there, covering many different products. It has been a roller coaster ride, and as every good ride, it leaves you thrilled and ready for the next adventure.

In my case, the next adventure is leading the Experimentation Platform team for Sony PlayStation. So I thought it was a good time to go over the learnings I had in the last year at Shopkick as Director of Insights and Data Products. More in details I’ll share some learnings on how to successfully run experiments remembering that: “Success teaches us nothing; only failure teaches.” [Adm. H.G. Rickover]. In my plans, this is the first of multiple posts on being a data-driven Product Manager.

The power of a single metric

When I started coordinating the experiment meetings at Shopkick, I found myself staring at a long list of experiments, all running with multiple metrics. They were looking at increasing retention, improving activation, increasing the number of activities, increasing the time spent in the app, and so on… All of them together. And guess what? Yes, some of them were successful in moving one metric, but they were really successful as features? Not really.

OK, what’s the problem with that? Multiple:

  • The first problem is that they are potentially conflicting objectives: for instance, take retention and activation. It is relatively easy to shower users with PNS messages to make them activate. The risk of this strategy is that most users will remove the permission to send PNS or will start ignoring them leading to a reduction in retention. Another example is profit vs. revenue: I could optimize for revenue increasing the marketing spending thus growing the cake or maximize for profit by reducing the marketing spending. But most of the time I heard executives and PMs saying they want to increase both profit and revenue, and my head starts banging hard against the table.
  • The second problem is even worst. When you run an A/B testing experiment, you usually accept the result when the p-value is 5% or lower. The problem with that resides in what is the p-value. A 5% p-value means that you expect to see at least one false positive every 20 experiments. If you have ten metrics (and they are independent of each other), you have 10 independent experiments, so you have around 40% probability of detecting an effect even if it doesn’t exist.

This problem usually happens when the PM team doesn’t have a real vision on how to evolve the product and which is the value it tries to deliver. They have a set of vanity metrics they want to optimize, vague objectives but no real vision of what in their products offer values to the users. Selecting a good single metric is difficult and deserve a discussion alone, but it is utterly essential.

Collect a lot of data and analyze it

Photo by Carlos Muza on Unsplash

This goes back to the first A/B test I ran in Shopkick when I moved from the Data Platform Team to the Presence Team. We were experimenting with a new button to allow the users to check-in. Run the experiment with two variations (well, initially it was supposed to be a horrible multi-variate hodge-podge that I fortunately cut, but I’m digressing). Of the two variations, the expected loser failed spectacularly, even worse than my worst expectation, but the other variation showed a decent lift over control, something I didn’t expect (yes, I was expecting the experiment to fail). As usual, I started wondering why. With the help of my data analyst, I found a few hypotheses that we successfully tested merely analyzing the data we already collected.

This suggestion is not in contrast with my previous point since the primary metric was still one. But having additional metrics, multiple dimensions, and a vast amount of data I was able to understand why my experiment succeeded, which were the determining factors, and that drove my next iteration. Now, we need a few caution words here. Indeed, the secondary metrics that were moving could have been just spurious movers, remember the 10 experiment example above? But the result was lining up well with some intuition we developed analyzing other data points (we had more of a perception problem due to slowness than a real functionality problem), and the sub-metrics were lining up with the experimental results. This does not imply we started a multi-month project to fix the issue. We, instead, ran another experiment to confirm the new hypothesis and then a new one and a new one, incrementally working towards creating an awesome experience.

Agility matters

Photo by Marc Sendra martorell on Unsplash

One side effect of proceeding as I’ve explained above is that you will be terrible at maintaining and executing a roadmap. I’ve been in a situation when I’ve been parachuted in a new team with 6 weeks to improve a feature that was not working. I’ve started working as I explained above. Multiple experiments, focus on the results, and on analyzing the data to build new features incrementally. A decent part of what we implemented failed. Despite that, in the next six weeks we had to revise our OKRs three times because we were continuously beating them (they were set based on previous experience). That’s the power of being experiment driven. The drawback? We executed only one item on the roadmap and, at the end of the six weeks, my roadmap was mainly an unsorted backlog of ideas that I was prioritizing just in time for sprint planning. Our Project Manager seriously hated me. Furthermore, the leadership was uncomfortable with this approach, so after the six weeks crunch, they started pulling resources from my team not because we weren’t effective or what we were working on wasn’t important, just because I wasn’t able to outline a long-term roadmap to justify the resources retained.

The key learning from that experience was that to support an experimentation culture you need a culture focused on being agile, at looking at long-term planning in term of outcomes, not features. And this approach should be clear at all the levels, from the leadership down to each team.

At the leadership level, it requires a lot of ability to relinquish control, trusting your PMs and Eng Lead to make the right choice for you. In exchange, the leadership should provide support and coaching to the PMs and establish clear objectives that need to be achieved. At the team levels, the PM and the Eng leads are in charge of clearly explaining to the team the objectives, the results of the experiment and why there are such big changes in directions every time; this is the most complicated part and will require a blog post by himself. Finally, the Project Management team should move from a more rigidly organized structure to a more fluid organization helping the PM sorting out the dependencies and coordinating the work with other teams.

Scientist or Businessman, which one?

Photo by Adeolu Eletu on Unsplash

One of the most interesting discussions I had while leading the experimentation practices at Shopkick was with a colleague from the marketing team regarding one of her experiments. She was testing different incentives programs to increase user activation, and she was trying to measure the elasticity of the incentives: only new users, 50/50 split, impact on activation as the experiment metrics, additional engagement metrics, time to activation metrics, etc. The problem she runs into was the incremental budget needed to run the experiment. She had relatively little budget to invest so she decided to use a relatively small bonus, one she suspected wouldn’t have been so effective.

The discussion was between me, suggesting that since the tested incentive was suboptimal, she shouldn’t have run the experiment and save the budget for other tests and her claiming that still the question was essential to be asked and we didn’t know, maybe the small incentive was anyway enough to cause a substantial move. In doing so, she exposed the fundamental question when running a controlled experiment as a Product Manager. Controlled Experiments are the gold standard of any scientist, and being a PM of a successful app means we can run experiments on many more people than most of the controlled experiment published in scientific papers.

In most organizations, I usually see two contrasting approaches: when experimentation is not fully understood, I see a businessman-like approach: “Give me some data to support my intuition, and I decide which is the right things to do.” When the benefit of experimentation is fully part of the culture, I start seeing a more scientist-like approach: “We must know, we will know.”

None of the two extreme is healthy. It is easy to be a HiPPO (Highest Paid Person Opinion), especially when faced with the complexity of running an experiment. Guess what? There’s always someone paid higher than you, plus, you want to keep your job, so better you could justify your opinion with reliable facts. At the same time, it is easy to experiment on everything. I’ve seen experiments that proved that sending push notifications increase app opens, no joke… I’ve seen papers showing that if you change the color of a button to be red, it works better… No wonder: the control version is a gray button on a gray theme, the experiment is red on the same theme. I don’t need an experiment to tell you that the red button gets more clicks. What I need is an experiment to understand how many more qualified clicks I get, how many of these clicks convert to a purchase.

So, what was the point of the example I gave? Well, it was definitively an interesting test to run, we had previous data point on a similar incentive program, and that additional experiment would have given us an insight into the elasticity of the program. The problem was the expected results: for getting the budget to run the incentives, we needed a high impact (and the experiment was designed to detect such an effect), but it was improbable, given the knowledge we have, to get that kind of impact. The experiment was great from a scientific POV, but from a business perspective the knowledge we gained from the experiment wasn’t justified (this is a complex topic, and I’ll cover it more in detail in the future).

I have another excellent example of this dichotomy. In Shopkick we had, in the first use flow, a page that was asking the newly signed up user to invite their friends. This flow is pretty standard, has been there for a while, and we spent a lot of time through the years optimizing it. Seems pretty uncontroversial, no? I think this is one of the multiple success stories of growth hacking that are shared around… So, what’s the issue? Well, our UX designers were growing uncomfortable with that page. It was intrusive, and it was encouraging spamming your friends before even knowing what Shopkick was about. From a UX point of view, it had to go… From a PM point of view? Well, a quick look at the data showed that less than 1% of the people were dropping off on that page. And we got more than 50% of the invited user from that page. Given invited users have a much higher probability of being retained, it seemed a no-brainer to keep the page and ignore the designers' suggestion, right?

Not so quickly. I don’t care about invited users, I care about overall retained users, and total kicks earned. A closer look at the data showed that invite from this page had a much worst conversion rate than other invites, meaning they were less qualified leads. Also, people who skipped the invite step ended up inviting later: so it is possible removing that page would not reduce the total retained invitee but only increase the invite cycle time (usually known as “Viral Loop”). Sounds about time to run an experiment? Exactly, experiment with as objective the overall number of week-4 retained users. The test almost confirmed the hypothesis. I didn’t actually account for the increased time to invite, so when we looked at the experimental data, we saw a drop in overall retained users but looking at cohort results, for older cohorts the difference was not statistically significative even at 80%. So I had four choices:

  • look at the primary metrics and declare the experiment failed;
  • re-run the experiment considering a longer time to invite;
  • wait to see if all cohorts behaved the same;
  • or, what we did after discussing the results: we removed the page, we kept monitoring the cohorts that were involved in the experiment, and we moved to a new set of hypothesis to improve the number of invites in this renewed scenario.

Why did we decide on this approach? In the end, we care about taking the best decision possible for our users, what the data showed us was compelling enough to remove the invite in the first use flow and to move to optimize other parts of the flow to make up for the eventual small drop in acquired users (remember, there is always a sensitivity problem). Additionally, from a scientific perspective, the results were in accordance to the theory we developed looking at the data, we could have experimented more to have higher certainty, but we deemed the evidence good enough to move to the next step of the experiment.

You are wrong

Photo by Andrej Lišakov on Unsplash

Through this post, I’ve admitted a few mistakes I made; and I made many more of them. I’ve bet against few features that instead proved to be successful. I bet on some features (multiple indeed) that were failures. Furthermore, I’ve celebrated outstanding results only to realize they were so good just because the metric was a bad one.

Admitting when you are wrong is difficult but extremely important and each and every mistake you make it is an opportunity to learn something. The real failure is not making a mistake but failing to learn from that.
Why I’m saying this? Because when you are a data-driven, experiment-driven PM, your life will be full of errors. If you are not happy with being wrong, look for another job, there is no place for you. Statistics are showing that only between 20 and 30% of the proposed experiment actually show a significant and positive impact, all the other experiment are insignificant or cause damage. Is this an issue? Not really, unless you invested a lot of money and effort in building the failed feature. And then, if it didn’t work, it is probably better to know it for the next time instead of being unaware.

Unfortunately, this vision is sometimes hard to accept for many PM and many business leaders. This does not reduce the importance of admitting being wrong; it actually makes the case for young Product Leaders to learn the lesson and grow to be better Business Leaders.

--

--

Luca Albertalli

Product Leader, Experimenter, Entrepreneur, and Startup Advisor. I often play with code but more often with ideas!