Has A/B Testing gone too far? 

Is disadvantaging people a necessary side-effect of making more data-driven decisions? 

Akshita Ganesh
I. M. H. O.
Published in
4 min readNov 26, 2013

--

The Valley likes to A/B test. Yes, they do. Immensely. More than you can imagine, unless you’re a product, marketing or growth manager somewhere.

Sure, you’re probably thinking now — big deal, so what if my “Log In” button is red instead of green. I don’t mind enough if my Facebook chat bar was moved from left to right to read through this article.

My answer — Neither do I! I’m O.K. if Facebook wants to try button placements or fonts out for a couple of days before picking the right one.

So, that’s type 1 testing. It’s binary, it has a clear hypothesis and it disrupts a small percentage of your user experience. When your friend gets the cool new menu bar before you do — sure it bothers you, but you know that you’ll have it in a couple of days.

What’s type 2 testing? Multivariate, with less of a defined goal, comparing a series of different versions to arrive at the one that optimizes a single metric, or series of metrics and then fly with that variant.

Gaming companies A/B test ruthlessly. They do multivariate testing just as ruthlessly. That’s their thing. Read a job posting for a product manager at any gaming company — “Data-Driven Decision Maker”. The fateful Triple-D. Data-Driven Decisions. What they mean is — A/B test the life out of a product by showing it to a small group of players. Show is how it’s moved every single miniscule metric — and then decide if it was worth even building it at all.

I get it. Fundamentally, game studios — designers and product managers — have so much they want to build. Most full-blown features are expensive to develop and design. So build a quick-and-dirty version, test it out, see the upside and then go all out.

Pincus was quoted saying :

We built a data warehouse with a testing platform so we’re running several hundred tests at any given time for every one of our games. And no single user has more than one test.

And that’s exactly what they do.

From a business angle — it makes perfect sense. It’s cost-effective, it tests out a hypothesis and decides if the ROI makes it worth investing more resources to take a Triple-D.

But. Is it fair?

In games where comparative player progress is the main metric on a leaderboard and leaderboard progress gives the player some in-game benefits, A/B testing can become a vicious cycle over time.

I’m going to take a very simple example to illustrate this. Imagine a simple game where you click a button → you get X coins → you buy another button with X coins → you click 2 buttons → you get 2X coins → you buy 2 buttons more and so on.

Now, the product managers at this gaming company decide they want to figure out the ideal value of X on each button click which will optimize fun AND revenue. What do they do? They say — lets do a test! Fireworks! Ok. PM 1 says — I think 0.8X is the magic number. PM 2 says, naaah, its 1.1X. The game designer says — no you idiots, its X. I’m the designer.

So PM 3 decides to test this.

There are now 3 players :

A gets 0.8X — so he can buy 0.8 buttons after 1 click, gets 1.6X coins, has 3.4 buttons after 2 clicks.

B gets X — so he can buy 1 button after 1 click, gets 2X coins, has 4 buttons after 2 clicks

C gets 1.1X — so he can buy 1.1 button after 1 click, gets 2.2X coins, has 4.3 buttons after 2 clicks.
Imagine a leaderboard in descending order of buttons — the person with the most buttons on a given day wins the daily leaderboard! The winner then gets 5X coins, second place gets 3X coins and third place gets X coins.

At the end of the day, where are the 3 players now?

A is 3rd place — has 3.4 buttons and X gold = 4.4 buttons

B is 2rd place — has 4 buttons and 3X gold = 7 buttons

C is 1st place — has 4.3 buttons and 5X coins = 9.4 buttons

Imagine 10 such days. Then the PMs concluded that the designer was right after all and makes everybody get X coins on each click since it optimizes revenue and retention (a proxy for “fun”).

I have chosen not to create an excel model of the 10 days, but if the reader wishes to see what happens after day 1 he is welcome to.

What does he find? Player A has, by virtue of chance, been set up for ultimate failure. Even if he plays harder than all the other players, squeezes in an extra click here or there, he will be 3rd on the leaderboard on day 1 and hence be 3rd on the leaderboard on day 2 and hence be 3rd on the leaderboard on day 3 ad infinitum.

Ask yourself again, is this fair?

Testing is an amazing tool which can enable optimal player experience, enable businesses to try out ideas without investing significant resources and building a complete version on validation.

Equally important is to balance this with fairness.Maybe by day 15, player A would have been sick of losing despite his best efforts and quit playing. Product, marketing and growth managers need to understand the impact on a system not just during the period of the experiment but over the lifetime of a use and test weighing the benefits of the test with the distortion of user experience.

You can follow me on Twitter at @akshitag.

--

--