A/B Testing for Product Managers

Andrew Hon
Product School
Published in
10 min readFeb 16, 2022

A/B testing (aka controlled experimentation) is used by top tech companies to build better product. I’ve built A/B testing at Disney Interactive and at Tinder, and the use of A/B testing continues to grow across the tech industry. This article is derived from a talk I gave to developing product managers at Product School in 2021, as a kind of practical introduction. Here’s why you should be A/B testing too!

TL;DR in three points:

  1. A/B testing is a simple idea that can be simple to apply
  2. Useful for more than incremental optimization — A/B tests can yield deep insight
  3. Fail Fast and Just Test It — A/B tests have the highest ROI of any data activity

The phrases “A/B testing” and “controlled experimentation” will be used interchangeably throughout this article.

So, why experiment? Say you were interested in understanding why people get bitten by sharks. In the course of your research, you might come across this striking correlation between shark attacks and ice cream sales:

Ah ha! Clearly ice cream sales are causing shark attacks! We even have photographic evidence!

Joke aside, it’s much more likely that these two trends merely correlate, and what’s truly driving shark attacks is the same thing that drives ice cream sales — hot weather that peaks in the summer, causing people both to go swimming in the ocean and to enjoy eating more ice cream. This illustrates the limitation of using simple correlational data analysis to determine why something happens, aka causality. As the saying goes:

“Correlation does not necessarily imply causation”

A better way to determine causality is with an A/B test or controlled experiment, like this:

Courtesy of Oregon State University

The plants on the left started out the same as the plants on the right, but then the ones on the left were given a special treatment. The group on the right did not receive the treatment and is being held back as a “control” group. We can see differences emerging between the treated group on the left and the “control” group that suggest the special treatment is having an effect. With enough pots of plants in each group, called “samples”, statisticians can crunch numbers to tell us exactly how certain it is that the effect we’re seeing is not due to random chance. This is how a controlled experiment allows us to make a strong claim about cause and effect.

How about a real example in a software product people use every day?

These are three sets of button icon designs we A/B tested at Tinder within our newsfeed feature. The intention was to update our iconography with a newer design language — the middle and right set are different versions of a more modern outline motif. However, within this experiment for our purposes, the original set at the left measured 5% to 8% higher engagement! A/B tests take effect on the actual product millions of real people are using every day, so we learn directly from where the “rubber meets the road”. The A/B test shows us the potential reality of the feature being rolled out to the entire userbase.

A/B testing isn’t limited to simple button changes. The first user session of an app is often a fruitful place upon which to perform extensive testing. At Tinder, we discovered via an experiment that removing our existing First Time User Experience (FTUE) didn’t hurt the user experience. That means the feature wasn’t adding any value by being there! Then, we tested a new FTUE design and it improved the experience for a strategic market. This gives us two takeaways:

  • What worked in the past may not anymore — question assumptions!
  • Swipe Right® might seem like second nature in the US now, but different cohorts may benefit from more/different education

Questioning assumptions is important especially if you care about growing to a global audience. Our intuitions aren’t perfect. We may not represent our target demographic in terms of gender, age, or locale. Tinder is a popular dating app because it appeals to a wide range of people across the world.

“Fail Fast” is a trendy phrase at tech startups. What gives the phrase meaning is the understanding that it’s not a failure if you learn something. Here’s a fun, visual example of a rocket exploding:

From https://arstechnica.com/science/2020/12/starship-rises-high-performs-a-flawless-flip-but-doesnt-quite-stick-the-landing/, but funny enough their article has changed from when this image was grabbed, so perhaps they performed their own A/B test and optimized to a different set of images

The context is SpaceX is in the process of developing an ambitious new rocket (dubbed Starship), and their rocket development process involves flying test rockets. Oftentimes their tests end in explosions, as we see here. And yet, despite the pile of wreckage in the bottom photo, here was the CEO Elon Musk’s public reaction on Twitter [emphasis added]:

Not only was Elon Musk not dismayed by the explosion and wreckage (euphemistically termed a Rapid Unscheduled Disassembly), he was actually thrilled with the outcome! In traditional aerospace this would be a disaster resulting in congressional hearings. Instead, it was more important to Elon Musk that they learned something from this test — that they got the data they needed to continue improving the rocket. And they did — a couple of attempts later they succeeded at landing the rocket. Thus the takeaway:

  • “Fail Fast” — It’s Not A Failure If You Learn Something

This lovely YouTube video compiles the many “failures” (explosions) of SpaceX’s earlier Falcon 9 rocket during its development. SpaceX compiled this video themselves — they are not shy about sharing their trials and tribulations. Now, of course, we know the Falcon 9 as the first orbital rocket ever to land itself. The Falcon 9 has been wildly successful in the launch industry, supplying the International Space Station, launching a Tesla into space, classified military payloads, and countless commercial satellites.

This Economist graph shows how SpaceX (dark blue bar) has been growing and taking over the global commercial launch market. SpaceX’s path to dominance was only possible because SpaceX was not afraid to repeatedly “fail” — to test lots of their rockets and have a fair number of them explode.

Failing Fast can also be done with controlled experiments. A/B tests are not just about incremental optimization. Entire new products can be conceptually tested in the market. In the game industry, potential art styles and settings have been chosen to good effect using ad-based fake door tests that only cost thousands of dollars to run, compared to millions of dollars for the development of a complete game. The game industry is a multi-billion dollar industry, bigger than movies and sports combined, and some of the most successful game studios are the most rigorous about A/B testing and optimization.

Tinder Bottom Nav UI experiment — notice the buttons

At Tinder, we also learn from failures. Two different initiatives, many months apart, attempted to make fundamental changes to the navigational layout of the Tinder UI. The first version, shown above on the left, didn’t work. Eventually, our next attempt did work! The second time was the charm.

Another Tinder example: a common piece of feedback we receive is “why don’t you show more profile bio text?” Indeed, we know (from an experiment) that it helps to show bio text — when it’s available. The problem is: not enough people fill out their bios. How could we encourage more bios?

We ran an experiment to present the option to users to add a bio during onboarding. It didn’t help. Requiring that users add a bio, even with instructions and explanation, resulted in user dropoff. Finally, we discovered (through another experiment) that showing more lines of bio text on profiles by default inspired members to improve their own bio. Ultimately we learned that there‘s a time and place to encourage adding profile bio text.

The greatest number of failures I’ve heard of when testing a feature comes from a major tech company trying to get a new photo display experience working. It took 5 attempts at the feature, each A/B tested, before their new implementation outperformed the legacy photo experience. The fifth time was the charm. You have probably used this feature.

To sum up this section: “If At First You Don’t Succeed, Try, Try Again!” Big swings may not work at first — most experiments aren’t successful. As another saying goes: “Absence of evidence is not evidence of absence” — just because one attempt at an idea didn’t work, doesn’t mean it can’t ever work. Subtleties in design and implementation, or in timing, can make a huge difference. Instead of investing heavily into research, multiple design cycles, and risking analysis paralysis, sometimes you get better returns on investment from collecting data on iterations in the real world, when you Fail Fast and Just Test It.

At Tinder I am the product manager for our in-house experimentation platform we call Phoenix. Our goal is for Phoenix to be a trustworthy platform that makes it fast and easy to set up, manage, and conclude controlled experiments. Key feature areas include:

  • Ideation and design
  • Configuration and management
  • Analysis and conclusion
Screenshot collage from Phoenix highlighting campaign management, feature flag management, and reports

We launched the platform in 2019 and usage is growing nicely, roughly doubling every 12 months. The number of concurrent experiments currently sits at about 400, or around 100 if you group similar experiments that are testing the same feature or hypothesis. Building our own platform is paying dividends in this way as third-party solutions would be charging by volume, which doesn’t exactly encourage greater A/B testing usage. Our tighter and more responsive integration with other internal Tinder services is another key benefit. We stand in good company as it appears every major consumer tech company has also built its own in-house A/B testing platform. Some have multiple A/B testing platforms, and some have as many as six!

Speaking of other major tech companies, here’s a collection of quotes about their experimentation cultures I find inspiring:

Jeff Bezos, Amazon: “Our success at Amazon is a function of how many experiments we do per year, per month, per week, per day.”

Google: “… Experimentation is practically a mantra; we evaluate almost every change that potentially affects what our users experience”

Netflix: “… Every product change Netflix considers goes through a rigorous A/B testing process before becoming the default user experience.”

Mark Zuckerberg, Facebook: “… The key is building a company which is focused on learning as quickly as possible… Building a company is like following the scientific method… We invest in this huge testing framework… At any given point in time there aren’t there’s not just one version of Facebook running in the world — there’s probably tens of thousands of versions running.”

The fact that trillion-dollar tech companies are all believers in controlled experiments is suggestive. However, it’s difficult to prove the value of A/B testing to the level of analytic rigor we encourage for experimenters. As the meme goes,

He could A/B test others, but not himself.

Ironic. He could A/B test others, but not himself.

Like many scientists have wished, if only we could create parallel universes at whim! The best I can offer is this correlational analysis:

Revenue and experiment data compiled from publicly available sources, such as blog posts like this one that mention the number of concurrent experiments running at a company, cross referenced to likewise publicly available revenue figures. Reach out to me if you have another data point to add!

As you can see, there’s a clear trend between # concurrent experiments (x-axis) and revenue in $ billions (y-axis). Tinder falls in the band, and our goal is to follow the trend up and to the right!

A/B testing can be a win/win that drives business performance while simultaneously improving the user experience. At Tinder, we ran an experiment where we promoted civility with an “Are You Sure?” message. We would intervene with an Undo Message prompt when members attempt to send a message containing objectionable language. Results were positive and myriad:

  • Fewer messages containing objectionable language sent
  • Fewer reports about harassment
  • No reduction in engagement of members who were given the prompts
  • “Are You Sure” feature well received in the press

It’s not easy to build rigorous and scalable experimentation systems for demanding internal customers, but this kind of measurable success makes the effort meaningful at the end of the day.

What’s that? You want me to give you some insight you can really use, like on your Tinder profile?

Okay! Here are some tips for your Tinder pics:

  1. Show your face
  2. Smile
  3. Not in a group
  4. On a beach (if you must have a pic showing skin)
  5. With a pet

These factors have been experimentally proven on Tinder to help! 🙂😃😄😁😊🥰😍🏖🐶

More serious parting advice for those getting started on their experimentation journey: To maximize ROI of your data-driven product development,

  1. Concentrate analytics on the most important KPIs and user dimensions
  2. Focus experimentation towards the top of the funnel, or where important interaction$ occur
  3. Build MVPs — test early and often

I’ll leave you with these three main takeaways:

  1. A/B testing is a simple idea that can be simple to apply
  2. Useful for more than incremental optimization — A/B tests can yield deep insight
  3. Fail Fast and Just Test It — A/B tests have the highest ROI of any data activity

Hope this helps. Reach out on LinkedIn if you like my work. And if you like what you’ve seen about Tinder’s experimentation culture, check out our job postings — we’re hiring!

https://www.linkedin.com/company/tinder-incorporated/

--

--