Threats of A/B tests and UX research: adoption time and incrementalism

Adam Plona (@adlon)

Marissa Mayer

You probably know the story about the forty shades of blue. This is an experiment organized in 2009 by Marissa Mayer, then product manager at Google, today the CEO of Yahoo. Earlier, Google had been using two shades of blue color for links — one for the search engine, and a different one for email. Unable to pick just one, Mayer decided to test 41 shades of this color. So, a 2,5% sample of Google search engine users was divided into groups. Google showed each group the links in different colors. After several weeks, the winning color was picked. The most clickable color, Mayer said, was slightly closer to purple than green.

For the next several years web designers and product managers used this example as proof of Google taking its love of data too far. After all, it does seem excessive to test 40 shades of blue. But in early 2014, Dan Cobley, the managing director of Google’s UK division, revealed some more figures, promptly shutting down all the scoffing critics. It turned out that this small experiment yielded for the Mountain View giant… an additional 200 million dollars per year.

Again. 200 million.
What’s not to love about tests like that?

Today, Google organizes more than 7000 A/B tests per year. Every UX designer knows that conducts a thousand concurrent tests. A/B tests are used by the world’s leading politicians. For example, we know that in Barack Obama’s email campaign, which was fully based on A/B tests, the most frequently opened message was the one titled “Hey”. Also qualitative studies are no longer a costly extravagance, but a normal step of the process. As you can see on the graph, interest in UX tests continues to grow.

Interest in UX research over time

I don’t know about you, but I can vividly remember times when the development of internet products was based on something very different. On intuition, convictions, book examples and common practice. Nobody asked whether the designed change had been tested prior to implementation. Today that would be unprofessional, especially in large organizations. Today, to change something, we need to back it up with numbers, or at least with a qualitative study report. If the test turns out well, you implement. If not — tough luck.

To put it briefly — instead of wandering in the fog, now we rely on hard data. But there’s a catch.

We’re all neophytes of the data-driven philosophy. And neophytes are always uncritical and to disregard the weaknesses of their religion. And that’s what I wanted to discuss today. Let’s begin.


An automated activity acquired in the course of frequent repetition. This simple word is fundamental to every UX designer or product manager. It’s because every interface is a nest of such habits. Every website, app, internet product is an incubator of habits.

When you go on Facebook, you no longer think where to check for notifications or how to write a new post. You see something interesting and you want to take a photo with your phone and then upload it to Instagram? Sounds banal, but in fact it’s quite a complicates series of activities. But those who use Instagram do it almost automatically.

Every UX designer knows that there are also global habits, which our industry refers to as conventions or design patterns. It’s thanks to them, among other things, that in a new website you always look for the search bar in the upper right corner of the page, and for the logo in the left.

Habits can be really helpful in our daily lives. Sometimes, they can also make it so much more colorful. If you’re like me and you ever wanted to sent a hot text to your girlfriend, but you habitually picked the first person from the recent recipients list and… it turned out to be your mom, then you know what I’m talking about ☺

But let’s go back to the crux of the matter.

Habits can be considered loops. This is the visualization proposed by Charles Duhigg in his otherwise very good book “The Power of Habit”. Duhigg wrote that habit is a constantly working loop: cue — routine — reward. I want to have a pleasant evening — that’s my cue. I text my girl — that’s my routine. I send the text and catch myself smiling to myself — that’s my reward.

Habits are established when we repeat an activity often enough. So if you often use an internet website, read the news on your favorite portal, check the game results, or just look at the weather for tomorrow — sooner or later you’ll form at least a dozen habits.

Simple. Nothing controversial or innovative.

But the fun starts when a forced change occurs in the habit. And here’s where we come in — the people who create internet products. Every one of us — through project decisions — has the power to influence the habits of our users, specifically the main element of the habit loop — the routine.

Let me use an example that I gave at the UX Poland 2015 conference. Two months ago, on Wirtualna Polska’s homepage (the biggest portal’s homepage in Poland), we conducted an A/B test of nine variants of the navigation bar. I’ll show you three of them. The most frequently used navigation element — on practically every Polish portal — is the mail link. It collects tens of thousands of clicks each day. On the current header, the mail icon was always next to the search bar. But in these three variants we changed its placement.

Multivariant test of navigation bar

And what happened? The test showed that the icon that replaced the mail icon next to the search bar generated most clicks in each variant. It always worked best. What doest that mean? It means that users used that element from memory. They clicked a familiar place, not noticing that the icon and its name had changed.

You probably already know what we were dealing with. A habit. And we influenced it, quite inadvertently, by changing the standard placement of the mail icon.

And this is the crux.

Every time we want to test any change in any system that will influence the user’s habit, we can be sure that tests will not tell us whether what we want to change is better or worse than the previous solution. The test results will probably be surprising, but they won’t give us an answer, or the answer will be negative.

This was our case — we suddenly found that in the tested variants, the number of clicks on the mail icon dropped dramatically. We were wondering where those clicks had gone when we noticed the unusual behavior of the icon next to the search bar. We then understood that we had stepped on a habit’s toe.

You could say: “you should just test the interface change long enough”. Long enough for users to change their habit and only then check whether the change is better or worse than the current solution.

Unfortunately, the business reality is not that perfect. To explain why, we need to go back in time a little.

I’m sure you heard that 21 days is all you need to shape a new habit or change an existing one. This myth is a misinterpretation of studies conducted by plastic surgeon Maxwell Maltz. In 1960 he published a book entitled “Psycho-Cybernetics” which sold 30 million copies and became almost instant bestseller. In the book Maltz described how patients on whom he performed serious plastic surgery, e.g. nose modifications, needed around 21 days to get used to their new appearance.

Maltz wrote: “Observed phenomena tend to show that it requires a minimum of about 21 days for an old mental image to dissolve and a new one to jell”.

This was followed by an outpouring of all sorts of guides on how to change any habit within 21 days. However, it is easy to miss one important part of Maltz’s writing — the words “a minimum of”.

In 2009, doctor Phillippa Lally, a health psychologist at the University College London published the result of another experiment. She investigated 96 people who were tasked with forming a new habit in their lives over 12 weeks. Results?

Indeed, those who wanted to make a habit of drinking a glass of water with lunch, were able to form that habit within around 20 days. However, those who chose a more difficult habit — eating fruit for lunch — needed as many as 40 days. To shape the habit of taking a 10-minute walk after breakfast, the subjects needed 50 days, and getting used to doing squats after morning coffee required 84 days.

Doctor Lally’s subjects needed between 18 and even 254 days to form a habit. On average, it took 66 days. But doctor Lally’s main conclusion was that shaping habits is a highly personal characteristic and it is impossible to predict the duration of the process.

But let us go back to UX.

It is true that by performing a sufficiently long test we would be able to remove the influence of habit change on results. But in a business reality this is a pipe dream. Who has the time to test a single element for three weeks? Not to mention 66 days, which is over two months. What about time-to-market, ASAPs and deadlines? And even if we took tests that long, we would not be sure that enough time had passed.

OK. Let’s leave habits, because they’re only the beginning of our problems with tests and studies.

Any change of an important part of an interface carries one more risk.

If you’ve ever been involved in a large internet service, one with millions of users, you surely know how people react to major change.

First, angry users’ reactions to recent redesigns of major Polish portals

What you see are user reactions to recent redesigns of several major Polish portals, including our Wirtualna Polska. Why all the rejection and anger?

The culprit is a phenomenon described by psychiatrist Elizabeth Kübler-Ross. It can be simplified and visualized with the so-called Change Curve.

Change Curve.

It illustrates how people react to major breakthroughs, revolutions and change. This psychological mechanism works every time, although in different intensity. For this reason the model is widely used in business in change management: when the CEO is replaced, when the team is downsized, etc.

Simplified Change Curve

When we show users the new layout of their favorite service or when we significantly change the way it works, we place them on the same psychological rollercoaster. Users, especially hard users, gradually go from denial and anger to acceptance and involvement, and their attitude to change changes over time from extremely negative to positive.

So very human.

Loyal users invest a lot of time in learning the current interface. They use it very efficiently and can quickly achieve their goals within it. Check their mail, share photos, send texts or simply read articles in their favorite section. When change occurs, their whole investment goes down the drain. They need time understand and appreciate the improvement we implemented. They need to move the next step on the change curve.

And here’s the catch.

When we design a major change and want to test it, our subjects in one of the first two phases.

Tests and studies take place in these phases

Their service is about to change. Their investment is about to be lost. This makes their opinions or behaviors unfit to serve as the foundation for a decisive evaluation of whether the thing we want to do is a change for the better or worse. Most subjects, if they had used our product earlier, will be dissatisfied.

A few examples.

1) Facebook. Look, the same process is repeated every few years. Every subsequent change in the layout or the way the news feed works causes a very similar reaction.

Reactions to changes on Facebook

Protest. Petitions. Demands to revert to the previous. Threats to delete accounts. Users who protested in 2006 gradually got used to the new news feed. When Zuckerberg changed it again, they protested again in 2009. Around 18 months later the change curve worked once more. The users entered the acceptance phase. But Facebook changed yet again, and we had a rerun of the same old show in 2011. And the same will happen with the next change.

2) Remember when Google introduced the new search mechanism, the Instant Search, where results appear as you type your query in the search field? Users protested. Here’s one of them. Today we don’t even notice the mechanism.

Instant Search by Google and an example of first reactions

3) And Microsoft. I remember well the wave of hate that charged through the Internet when the Redmond giant implemented to so-called ribbon navigation in Office. Everyone protested, myself included. I was used to the old interface which I knew inside out. That change killed most of my habits. Today I don’t have any problems with the new interface. I can even say I like it.

Old vs new Office interface

The process of transforming habit and reacting to change are two elements contributing to the biggest problem in tests and studies:

adoption & incrementalism.

Don’t get me wrong. A/B tests are a great tool to test interface changes that do not modify user habits. Habits that do not invalidate their current investment in learning the system. It is thanks to these tests that we know that the red button works 34% better than the green one. It’s thanks to them that we know which of the 40 shades of blue is the most effective color for links.

The red button is 34% better than green

Unfortunately. The most interesting changes cannot be researched in depth.

In those cases, tests and studies are just a tool to predict short term risk. They let us know what the first reaction of the users will be, how many hits we will lose, how many people will be disappointed, by how much will our revenue fall in the first weeks after the change. However, they won’t tell us much about long term change. About what will happen when users move on to the next stages on the change curve: acceptance and engagement. Will it be better or worse? This the tests will not show. What they will usually show is just the depth of the first wave of dissatisfaction.

Tests and studies only tell part of the truth. They don’t say what’s better or worse, they won’t tell you to implement or not implement. They serve as early warning signs, but they don’t provide final answers. We are alone in our assessment of long-term effects.

Intuition, knowledge, experience, vision and strategy. These elements should not be underestimated in this data-driven world.

If we’re certain that the change we’re introducing will improve our product and business, we should not rely on tests because they will lead us astray into what’s referred to as incrementalism. They will encourage us to implemented hundreds, thousands of small changes, none of which make a radical change, but will just optimize our interface. And then we’ll reach the local maximum and won’t be able to squeeze more out of our product, service or application. We’ll need radical change. And to implement that, we need to look differently at the results of tests and studies.

“If I’d asked my customers what they wanted, they’d have said a faster horse”, Henry Ford once said when asked why he didn’t do focus groups.

Steve Jobs was of a similar opinion. Today even companies famous for their data focus realize that intuition and vision are equally as important as incremental optimizations.

Scott Huffman, engineering director at Google, said: “One thing we spend a lot of time talking about is how we can guard against incrementalism when bigger changes are needed”. He then paraphrased Ford: “If you rely too much on the data, you never branch out. You just keep making better buggy whips.”

It’s almost certain that your work will at one point or another require you to make a serious, big, revolutionary change. It will be necessary to keep your product growing. Such change always involves risk. It cannot be studied, it has to be carried. You have to face the problem of adoption.

No risk, no fun.

This is a transcript of my speech at Product Camp 2015 conference. Polish version is available on my blog. If you like it, feel free to share on Facebook or Twitter. And BTW — i’m @adlon at Twitter.