Tested on people, or how, and what for start A/B testing?

JustAnswer. Company
JustAnswer
Published in
16 min readJun 23, 2020

Hello, my name is Pavlo, I’m an engineering-manager at JustAnswer. Previously — lead, before that — developer. UMC, MTS afterward Conscensia and JustAnswer. Overall, almost 15 years in IT. But I have a rather unusual background as for developer: I graduated from sociology faculty. And it helps tremendously during the work with a product — marketing, consumer behavior, understanding of business needs. Last four years I work in the environment, where business decisions are accepted exclusively after the proposal of hypothesis and its checking with A/B testing. I joined the company on the stage where the complexity of the hypothesis, which business wanted to check reached the limits of the currently used platform. The company had to chose where to go next.

If you thought about the A/B testing on your resource, then, likely, the basic indicators for your project which you would like to improve have been already formed. Conversion, financial metrics, behavioral factors — what is relevant to your product or service. If there are no such metrics yet, the right step is to start by defining them.

The article will be useful, first of all, for developers, as it describes the main components of A/B testing and the basic principles of building tools for their implementation. However, there is no mention of such an important component as the preparation of a test plan and data analysis. However, this is completely another topic, worthy of a separate post.

A little bit of conspiracy theory

Has it ever happened to you that you’ve seen a very attractive offer on the carrier’s website, and when you decided to book tickets, the offer evaporated? Or you had a very good deal on a trial subscription to a music service, and on the same day, your friends don’t see the same option. Or maybe you just now see that the Amazon site on your colleague’s computer looks different than yours? You can assume that you are being watched by Big Brother (as in Orwell’s book), that there is a big conspiracy against ordinary Internet users and you became its victim.

But most likely, you just became a participant in A/B testing. And yes, you are being watched!

A/B testing. What is it

The concept of “A/B-testing” comes from traditional marketing and is essentially a method of research in which the control elements (variation A) of the object — product packaging, advertising, web pages, etc. — are compared with similar ones (variation B), in which one or more elements have been changed to find out which of the changes improve the target indicator (Wikipedia).

Let’s make it simple

To put it in the simplest way, in order to find out whether the new red button will actually be pressed more often than the old blue one, one-half of the visitors must be shown the red button and the other half has to see the blue-button simultaneously. And they have to be observed. In this case, the control group will be the people who saw the old blue button (variation A). The purpose of the test is to compare their behavior with the behavior of those who saw the new red button (variation B). Hence comes the name — A/B testing. If you want to check more options at the same time, there may be more variations. Then the test can be marked as A/B/C/D/../N. Although the most widespread variant is still A/B. How long to observe, which variations to show and to whom, too, is not chosen out of the blue, but it will be mentioned later.

Why so complicated

At the heart of any change is a hypothesis — for example, people are more likely to notice and press the red buttons. I don’t think anyone will argue that changes in the look or functionality of a product affect consumer behavior. And, of course, we would like this impact to be as positive as possible, but the fact that someone from the “superiors” thinks the hypothesis is correct does not mean that service users have the same thoughts.

Developing a certain functionality in full, taking into account all possible cases and compliance with relevant standards can be quite expensive. And given the fact that the feature may not only not bring additional benefits to the business, but also greatly harm it, such a “leap of faith” ends up being unjustified and risky.

In contrast to the traditional approach, A/B testing allows:

  1. Minimize the risks, because the hypothetical improvement will be available to a minimum number of visitors, sufficient to decide on the usefulness of the hypothesis. Here, mathematical statistics come in handy, which allows us to calculate the size of the sample. More than one dissertation has been defended on these topics, but the formulas scare readers away, so I’ll just say that there are many calculators online that can be used to calculate this. For example, you can start with a simple: A/B Test Sample Size Calculator.
  2. Save dozens of man-hours, because it is possible to test the hypothesis by omitting certain cases or not fully implementing the feature. For example, for the test, something can be hardcoded on the frontend, without implementing stakes on the backend calls or without extracting any data from the database, or without implementing support for some exotic browsers. Example: To test the hypothesis that sales will increase if you add support for a new payment system, before investing in the development of such an option, you can add the button “Pay via X” and count how many people click it to see if it will be in demand. And to those who clicked it, to show the message with an apology. This approach is also called the Demand Test.

A, B and their friends

Before moving on to practice and in order to avoid confusion, it is worth focusing on basic terminology.

The concepts of A/B testing and experiment are interchangeable in the context of this article.

The appearance of the page before applying the changes is called control or normal. It will be compared with the test, modified view (test).

Both control and test are variations, which are usually denoted by a letter index. Control variation A and test variation B.

If the hypothesis is confirmed and we want to completely replace the original look of the page with the variation look that won, this process is called normalization.

Gathering information about user interaction with page elements is called tracking. For example, we want to record if a user clicks a button. We need to register a unique identifier associated with this event. In simple words, we need to add tracking.

Prepare the A/B test yourself

But enough of bare theory. Let’s start figuring out the process.

There are many decent commercial solutions on the market with their advantages and disadvantages for different cases and with different pricing policies: Optimizely, VWO, Google Optimize, Adobe Target, etc. However, if A/B testing becomes an integral part of a company’s business development and decision-making process, then its needs will certainly outgrow any available product and require its own approaches. Therefore, in this article, we will try to find out how it works and how you can develop a prototype of the A/ B-testing platform to make an informed decision about choosing a platform.

Ingredients

Let’s analyze the components of the process of testing and preparation of the experiment.

Traffic

We need users. Without users, this whole idea doesn’t make sense. We need a lot of users. But we do not need all of them, we need special ones. Therefore, it is desirable to gather not all of the traffic but the targeted one. For example, the hypothesis of business expansion in the United States — we are not very interested in how visitors from Europe will react to the changes. Or we wonder how people who switched from a particular ad or viewed our pages from a mobile behave. That is, the traffic must be filtered.

Randomness

It is important to divide traffic evenly and randomly between variations. Who will get experience A, and who will get B (or C, or D, etc., depending on the number of variations)?

Transformation

After dividing users into variations, it is necessary to provide the user with the correct experience that corresponds to his variations.

Reproducibility

Anyone who has been to variation A should not receive variation B after reloading the page or returning to it later. The users will feel strange, and we will not be able to unambiguously determine which change has affected their behavior.

Activity tracking

Data collection and monitoring should be ensured. Let’s go back to the example with the colored buttons. To confirm or refute the hypothesis about the color of the button, the required minimum will be to record the fact that the user assigns to the variation and monitor the event of “button pressing”. However, the more different data points we collect, the better we can understand user behavior. For example, if we additionally track the size of the screen or window, we can find interesting patterns. Small screen owners may be less likely to press the button. The explanation for this behavior may be the realization that one needs to scroll through the page to get down to the button, which not everyone does.

Flexibility

Among the advantages of A / B testing are flexibility and swiftness. The test should start or stop regardless of the release of product changes. Preferably even without the participation of engineers. In the course of the experiment, there may be circumstances when the test should be temporarily paused or the audience settings will need to be adjusted.

Sometimes early monitoring of the experiment may indicate that the test is failing and the performance on the test variations has deteriorated greatly. At such moments, it may be tempting to stop the test immediately so as not to lose money and to conclude prematurely that the hypothesis was completely wrong and it makes no sense to test it throughout the whole sample. But this can only simply be information noise, statistical significance is achieved only through the full sample. And hypothetically, the money loss is minimized and it is a payment for information and conclusions. On the other hand, this may indicate a bug in the code or logic of the experiment, and it is worth further checking whether users see exactly what the hypothesis intended, and nothing is broken. In case of an error, you need to stop the test, correct the bug and start again.

What’s under the hood

There are two fundamentally different technical approaches to developing A/B tests on web pages, and obviously each of them has its advantages and disadvantages.

Here we will talk about techniques that can be applied to pages with any architecture. Obviously, knowing everything in advance, you can try to “tweak” or implement a new page architecture for a specific type of A/B-tests, which will take into account the advantages of both approaches. But such tuning is expensive and you can’t take everything into account in advance.

These approaches can be conditionally divided into Client Side (or Postload) and Server Side (or Preload).

Client Side

Within this approach, the user first gets the original page, and the JavaScript code of our A/B test determines the variation and applies changes as quickly as possible so that the user does not have time to notice the substitutions. That is, we have an original page and code, which is intended to make point changes — transformations.

Architecture

There is a target page. The page always loads and tries to execute a JavaScript file that is hosted separately and may contain A/B test code. If we do not test anything now, the file will simply be empty.

Our goal is to execute this code as early as possible during the page load cycle and make transformations before the original content is loaded. You can subscribe to DOMContentLoaded or load events, for example, (https://codepen.io/LukeAskew/pen/LnJsE) depending on the structure of the original page and the complexity of the test.

The A / B test code does the following:

  1. Filters traffic (optional) — decides whether the current visitor is our target audience. Device, browser, window size, user country, logged in, shopping history, is there already something in the cart, on what tariff plan, what pages have you seen before, etc. In practice, it’s just if Condition, which performs a series of checks and decides whether to apply transformations. You can even send an additional backend request on behalf of the user to get the details needed to make that decision. This can significantly delay the execution of transformations, but sometimes it is the only way to test the hypothesis.
  2. Determines whether the user has already been tested before and should show the appropriate changes, whether it is a new user, and whether he/she should be assigned to a variation. We do this through cookies and randomization.
  3. Performs the necessary transformations. That is, repaints, changes the size and position of the elements, removes them or replaces them with others.
  4. Adds tracking (optional). It worth emphasizing that if before the experiment we were not interested in user interaction with the target elements, we will have to add tracking of these elements to the original variation as well to be able to compare these variations.

Transformations can be small and subtle, but sometimes the hypothesis requires a radical overhaul of the page. Or before applying transformations, you need to wait for a certain element to load. In this case, the user can observe the transformation process, and such special effects do not add trust to the user about the site.

There is a compromise solution for such cases. It’s not perfect, but it can help. Let’s call it the “white blanket” technique.

As soon as the page loads enough to execute our code, we throw on top of everything a white opaque element that covers everything that happens like the download of original content and the transformation process. When everything has worked out, the masking is removed. From the user’s point of view, this looks like a delayed page load that appears instantly.

On JsFiddle you can play with an example of an A/B test built using this approach.

In this example, if you assigned to variation B, then pressing the Run button again will clearly demonstrate the side effect of redrawing elements with complex changes. To change the variation, you must clear the cookie.

To start, stop the test or change the distribution of traffic, we simply redeploy our JavaScript to the place where it is hosted by the means available to us.

Client Side Results

Pros:

Once the page loads, we have a lot more user information that we can use for more accurate targeting or as part of a modified experience.

In some cases it’s easier to write experiment code — you can repaint a button, for example, with a single line of code.

Cons:

  • Depending on the complexity of the variation changes, there may be side effects when the user has time to see the original page and can notice a flickering when everything is redrawn. There may be some techniques to deal with this, but in general it negatively affects the performance of the page, and for users on the variation the page will feel slower, which will affect their behavior and, consequently, the results of A/B test.
  • Slow pages are worse for SEO.
  • If the hypothesis is confirmed, and we are ready to apply changes of the winning variation to the entire audience, the test code can not be reused, one will have to edit the code of the original page.

Server Side

Here, in contrast to the Client Side, you need to prepare all (in the case of A/B there will be 2) the final versions of the page (variations) with the ability to access through separate links. The variation is determined during the processing of the http-request, and the user immediately gets the desired version of the page without applying the changes on top of original page. Then the whole architecture of the A/B test will be reduced to routing and setting cookies. Optionally, but preferably for performance, the page can be cached.

Of course, this can be adapted to the realities of each specific infrastructure, but here’s an example using Cloudflare CDN and their Service Worker. Service Worker is essentially JavaScript that runs on a server and allows you to intercept and make changes to the original http-request or redirect it to other addresses. And, importantly, this code can be changed dynamically through the admin or API.

What does the A/B test architecture look like in this case?

  1. The user requests the URL and gets to the CDN Edge server, where our Service Worker is running.
  2. Having determined the variation, start the request for the appropriate route. If the page is in the cache, the CDN returns it from there, if not — the request is sent to the origin (our web-server), and on the way back the response is cached. Let me remind you that a page must be prepared upfront for each variation.
  3. Before giving a response to the client, we add to it a Set-Cookie-header with the ID of the variation, so that subsequent requests land on [PK20] the same variation.
  4. We return the response with marking and cookie variations.

Example code and Cloudflare sandbox:

In this way we achieve the ultimate goal — users can “randomly” get different content under the same link. The only remaining thing is to trace their reaction.

To start, stop or make changes to the traffic distribution, we simply redeploy our JavaScript to Service Worker with the means available to us — through the Admin UI or through the API.

As we remember, A/B test traffic needs to be targeted (filtered). In the case of Server Side, we can only use the information contained in the original request. For example:

  • User Agent will help identify the browser or device;
  • Cookie;
  • IP allows filtering by geolocation;
  • Host. The A/B test is usually developed for one/several pages of the site, for the rest of the experimental routing is not necessary.

Results for Server-Side

Pros:

  • Because the desired variation is loaded immediately and unchanged, this approach creates original fast performance.
  • For pages with static content, each variation can be added to the cache.
  • If the hypothesis is confirmed, it is enough for us to switch all traffic to the winning variation.
  • Ideal for landing pages.

Cons:

  • Because we can only use data from the original http-request to target the A/B test, we don’t have enough flexibility in determining the test audience.
  • To build simple A/B tests, the overhead of development is higher, because even the slightest change requires duplication of the entire page for introducing these minor changes there.
  • This technique is well suited for landing pages, because they are the same for everyone. If the page is customized for a specific user or it contains a lot of dynamic content, the benefits are offset.

Activity tracking

The main reason why all of this was made is to collect data for further analysis. And the more different segments we collect, the better. However, it is necessary not to collect everything on your way, but thoughtfully decide what you need, and it is necessary to be limited to those things concerning the key metrics of the experiment.

If we go back to our blue and red buttons, the key metric will be the conversion (pressing the button). That is, we need to track “pressing” and “not pressing” as a basic minimum. This is enough to compare the effect of button color change in general, but it’s not enough to understand our audience. As additional data points (segments) we can track the user context. Browser, device, screen size, location, currency, payment method, time of day — everything that will allow us to segment the audience and analyze the behavior within these segments or at their intersection. No personal information is required and furthermore it is illegal to collect it. It is enough to have aggregated data to detect behavioral patterns or anomalies.

With such data on hand, we can conclude, for example, that the red button works better for users of 4k screens from Texas at night.

Well, during normalization, we can try to customize the page so that everyone can see the blue button, in particular this segment of users — the red button, if it brings additional value to users and benefits the business.

Architecture

How can this be organized technically? Again, there are big players in the market, such as Google Analytics or Facebook Analytics, each with its own API, set of rules and analytical tools. And hardly anyone will be able to compete with them in the field of data and analytics. However, it is organized approximately in the same way everywhere.

Everything is based on user-generated events. The user went to the site, went to another page, clicked, paid, and so on. Events take place in some context, again a session, a device, a region.

Events can be generated by default, such as a page load or a following link. And they can be customized, in this case they need to be added separately, such as pressing a button, hovering the cursor over an element, starting to interact with the form, scrolling through the page, everything that can be reached in the DOM and matters for the test.

Based on a set of anonymous and unique data available on the client (IP, session, agent), a GUID is generated to which all these events will be linked so that the data can be properly aggregated.

That is, a typical payload, which is generated and sent to the analytical platform, when the event is triggered has approximately the following form:

In practice, it all comes down to placing your page a JavaScript snippet of the selected tracking system, initialized with your account in that system. The default events will be sent automatically, and custom ones will need to be added to the page or experiment code manually.

The limitations of such tracking include plug-ins that block known as tracking systems to preserve privacy.

Browsers may also impose some restrictions on sending data to third-party sites. Such users will not be included in the experiment because we cannot obtain data about them. This is not a critical issue as long as the percentage of these users on our site is small.

If for some reason, there is a need to organize your internal tracking system, the approach remains the same. We generate events, send to the backend, store it in the database and think about how analysts will work with the data. The task is far from trivial, but understandable, and no browser will block such requests.

Instead of conclusions

A / B testing is a powerful marketing tool in general, and for eCommerce in particular. It helps to understand the behavior of the audience, make unbiased decisions based on hypotheses confirmed by the data, and as a result increase the conversion of landing pages, select the optimal user experience, improve SEO-metrics and more.

We live in a time of a new fever — information fever. Data is the new gold, and those who have it in sufficient quantity and quality have a great advantage over their competitors.

Pavlo Kyreyto,

Engineering Manager at JustAnswer. Company

--

--