AB testing by Google. Udacity. Summary

My synopsis of the course

14 min readAug 8, 2022

Quick notes

This is my synopsis of the AB testing course at Udacity by Google.

Big thanks to the team of instructors — Carrie Grimes, Caroline Buckey, Diane Tang — for a structured explanation of all the stages and for nailing down main pain-points of the process.

People may not be good at estimating quality of decisions which if combined with choosing options of more senior or more in power people may lead to poor future performance. That is why AB-testing is so popular now as it supports a better and more data-driven decision making and creates a framework to test all the important moves.

Table of content

PART 1. Overview of A/B testing
PART 2. Policy and Ethics for Experiments
PART 3. Choosing and Characterizing metrics
PART 4. Designing experiment
PART 5. Analyzing results

PART 1. Overview of A/B testing

What is AB testing for?

AB testing is a technique to facilitate making data driven decisions based on the analysis of performance variations between two groups of users having a different experience (eg. design changes, UI, UX, etc.) or treatment (incentives, etc.).

The following questions can be answered as a result:

what impact from a change is observed?
what is the magnitude of impact?
what is direction of the impact of the change?
is the result significant? (repeatable)
is the result practically significant? (brings business value vs costs)

All the calculations for this part in Python can be found in a GitHub Notebook

PART 2. Policy and Ethics for Experiments

This part is mainly about the data collection and how the need of collecting and tracking the data has to be communicated to a user. Topics covered:

risks
benefits
choice/alternatives
privacy (sensitive data, protection, handling)

PART 3. Choosing and Characterizing metrics

AB test helps to track causality and creates clarity in understanding of the impact.
It is a great tool not only from the perspective of continuous (iterative) improvements (Agile), but also helps to track the impact from each of the separate steps.

Metric definition

When defining what will be the metric for the performance one has to answer the following questions:

What will be measured as a performance indicator?
How it will be measured?
What it will be used for?

How to sanity check the metric?

After metric is defined and can be measured (calculated) the following checks have to be performed to ensure that there are no problems with the tracking or calculations:

AA test — the test when no change is applied and metric is compared for the splits. Results should be similar (eg. no change in metric).
How good random distribution is — check stratification by age, gender, region, language, etc.
How numbers are comparable to some other open or third-party data or, for example, compared to what is “common value” known for the industry

What are “high level” user metrics?

Business objective (market share, engagement, etc.)
Financial sustainability (revenue, ROI, GMV, etc.)

Funnel-based user metrics

These metrics describe how user is moving towards the desired outcome

Funnel example (web product):

Main page
- # users visited main page
Products exploration:
- # users viewed the list of product
- # users viewed product details
- # number of products viewed (count)
Actions
- clicks (CTR)
- create account
- completed steps: step 1 (like add product to the basket), step 2…
Repeated actions (second purchase, order, etc.)
- #returning users
- for subscription based businesses it can be users that continue their subscription for the next period (sequence of recurring orders)
Product specific actions (if a specific product or product feature is of interest):
- purchases of some brands, sku, etc
- clicks (order) for a particular product (probability of progression)
Optimizers (other elements that are added to the funnel)
- up-sells: bigger or more expensive variants
- cross-sells: different products (could be based on recommendation engine)

As a general metric for funnel “success” a count or a rate of progress (eg. percentage to the previous step) could be used.

Detailed metrics. Examples:

views (total)
unique views (NB! Pageviews can be skewed as page can be cached by a browser and no counter will be updated)
returning users (#sessions in different days, #repeated (recurring) purchases)
business metrics (revenue per 1000 queries)
time at page (in sec)
scroll point achieved (50% page, 100% page read)
bounce rate (users who leave without any actions)
click rate (NB! Double click filtration should be implemented (latency between clicks), it can be unintentional click like on mobile while scrolling)
active users (need to be defined what is considered as “active”?)
check-out (basket)
orders
log-in (number of consecutive log-ins of user)

Click through rate: # of clicks on one button or link / # of page views containing the button or link
Click through probability: # of unique visitors who click a button or link at least once / # of unique visitors who view the page containing the button or link

NB! As with all the cases, intuition behind user actions should be built. So questions like — Why people do or stop doing some action, for example, in the funnel? — have to be answered.

How to find ideas for good metric?

User experience research (UER)
- good for brainstorming
- can be in depth
- special equipment can be used (eg. eye-tracking camera)
- needs small number of users
- results need to be validated with, for example, retrospective analysis
Focus groups
- less in depth analysis (compared to UER)
- can include more users
- can get feedback
- can ask hypothetical questions (e.g. launching new product or changing inventory)
- need to be aware of “group think” (somebody from a group can shift group way of thinking)
Survey
- inexpensive to run
- can get information about metric that you cannot measure or collect information about (happiness, satisfaction, job offers, etc.)
- need to be aware about how questions are formulated, also could be that answers may not to be true
Retrospective analysis (old data)
- be aware of seasonality (weekly, daily, monthly)
- check time between actions (latency)
- other aspects of past performance
Academic articles
Colleagues
Human evaluators — people who use the product and provide feedback

Why effect may be difficult to measure?

There are cases when effect takes time longer than experiment (buy new house, get new job, etc.). It can be a period for the effect to mature of 6m, 1y, etc. which is much longer than experiment duration (couple of weeks)
Hard to collect (no access to data) or extract information (we can make survey, but response rate can be low or be biased)
Nebulas (things that cannot be clearly defined)
- how much skills are improved? how we can define and measure this?

There should be a clear consistent and specific definition of terms for the metric, how the elements are tracked and calculated.

For example, how we define “active user”? How you can define “active”? There are multiple parameters to be defined to call user as “active”, for example:

active during the last hours, days, weeks…
what actions (types) matter and how they can be measured?
how these “actions” are summarized into single metrics: sum, or percentage, or average per day, or median over week, etc.

With this defined you can now:

explore what is the current state
sanity check the distribution
observe if the metric is actually moving as a reaction to some change

How do we compare groups as a result of the experiment?

It can be one single metric, a set of important metrics or an aggregated metric which combines, for example, engagement and business impact.

There is famous “Jam Experiment” that demonstrates the paradox of choice when better engagement results in worse revenue. For such cases it is important to prioritize what is more important from the business perspective.

So, a good metric should be inline with the business value or expected business results. As a solution, it can be a composed metrics (OEC - overall evaluation criteria) which in above mentioned case can be for instance:

oec=w1*(C_exp-C_cont>C_min) + w2*(R_exp>=R_cont)
(C — engagement (# click, ctr, #views), R — revenue, w — weight)

This metric is a weighted sum and incorporates
- an engagement change only if this change is over the minimum practical uplift
- business result that exceeds control revenue.

For the business metric some epsilon range around 0 can be defined as accepted or allowed loss. Some minimal practical effect (delta) also can be defined and added to the formula.

What is the benefit of having a “overall evaluation criteria” (OEC)?

It is always easier to make decision based on one number than on a set of numbers. But it is better to provide all the components for the weighted metric as well as the overall result so a clear causality and explanation of the result can be developed.

With the OEC it can be tricky to define

One need to decide:

What components to include?
How to aggregate components to a single metric (what weights to use, is it a sum or multiplication, etc.)?

From business perspective using OEC:

May lead to overoptimization based on OEC but missing the changes in components
Can be hard to answer why we observe the change without looking into components

Normally, it is better to use more general (and reusable for other tests) metrics rather than one specifically designed for a test.

Building intuition behind the metric

Some common issues to be aware of:

Edge cases: end of the day (or month). If action started in one day and ended in another. Need to decide the attribution of such edge cases
different browsers may treat same actions differently
there can be differences in capturing data in mobile, tablets and desktops
touch screen devices have other unique events and miss some desktop events types (like “on hover”) and vice versa.

Filtration for the experiment

Before the analysis of the experiment data cleaning is needed

internal users (eg by IP)
some other IP addresses (with “strange” traffic)
spam, fraud, malicious
outliers (long sessions, too many clicks)

Slicing and segmenting

Common slicing categories are country, device, day of the week, etc.
This helps to check existence of anomalies or similarities. Also it is good for evaluation and building intuition about the metric.

For example, with time-based slicing it can be beneficial to plot week-over-week or year-over-year where a new data is compared (divided) to previous data of the same timeframe.

Summary metric examples

sums or counts — how many, totals
distributional — average per week (or mean, median, mode, percentiles)
probabilities and rates (eg. CTR, CTP)
ratios

Sensitivity and robustness of the metric

Sensitivity — metric should react to the change
Robustness — if there should be no change then there is no visible change

It could be a good idea to start the analysis with looking at the a histogram and check the shape of distribution.
Based on the intuition and distribution some of descriptive statistics could better than others describe the data.
For example, for normal distribution mean could be a good metric, but for exponential median or other percentile-based can make more sense.

Robustness example

Let’s assume we need to measure video load time or customer service call duration. In these cases there can be outliers that shift the mean towards right and median will be a better option then a mean as a metric.
But in cases when we need to evaluate some change that will impact only a fraction of users, then median will possibly not capture the effect. In such cases other percentile-based metric can work better (like 90%, 95%, 99%)

How to test sensitivity and robustness?

Run some small test
Run AA test. This can show if the metric is too sensitive or the “natural” variability of the metric
Retrospective analysis (new results are from some similar nature in the past)

Example

Let’s assume we measure latency of the video download time. There can be two groups of people which will be reflected as two modes in distribution — with fast and slow internet connection. These two groups will demonstrate different performance and may react to change very differently (for slow connection size of video matters a lot, while for fast-speed it is not a big issue).

Analysis of robustness

choose some similar objects (like video of the same size, length, quality, encoding, file type)
plot distribution of data per object and compare for different objects (should be similar)
plot set of metrics (eg. percentile) for these objects - if metric is not robust there will be fluctuation of value (zig-zaging)

Analysis of sensitivity

Test the change. With the change implemented the metric should be moving in a direction as it was expected (expectation vs reality)
Plot set of metrics (eg. percentile) for these objects - if there is no visible result (compared to what was expected) it may show that metric is not sensitive enough

As a result a metric that is both robust and sensitive can be found.

Analysis of variability

AA test (sanity check). Compare with what you were expecting to get
Bootstrap — run one bigger experiment and then randomly split this large sample of users to smaller groups, calculate metric and compare results. It will help to discover variability of values between samples.

PART 4. Designing experiment

How to assign users to Control vs Treatment (unit of diversion)?

There are following options of tracking:

event based (query, load)
person based
- user id (login, email, user_name)
- cookies (anonymous). specific for browser or device. can be cleared.
device_id (mobile only)
IP address (not very useful, but can be only possible choice in some cases, can be multiple users aggregated)

There is no ideal tracking. It can be that same person after some change appears to be in a different group (was in control in desktop but when switched to mobile happened to be in target).
NB! User consent may be needed for using cookies or user_id based tracking. If at some point, for example, email is stored for a cookie then this is no longer an anonymous identifier and may need additional consent.

The choice depends on the consistency of results needed.

Person based identifiers can enable tracking a “user journey” and provide better (compared to other) consistency:

user_id (if signed in, across all devices)
cookie (until is cleared or same device used)

Other common sampling methods are:

random sampling — random generator with predefined group size expectations
stratified sampling (pre-select features and make sure that new groups have similar distribution for those features — like age, country, etc.)
mod N — for example by user_id mod N. Some chosen are assigned to Target some to Control. Allows to play with the size of each group.

When consistency is a key?

Some changes are visible to users (like color of a button) and should stay consistent for user at least for same session or device so user-based tracking is preferred.
For changes which are not visible to users (changes in ranking or latency) “event-based” tracking may be used.

Unit of diversion vs Unit of analysis (can be denominator in a metric)

If event-based identification is applied then we can tell that events are independent (random draw)
With other identifiers we analyze “groups of events” (which are no longer independent) and this increases variability.

Cohorts

Cohorts (slices) are subsamples of population, for instance, based on time, browser, or combination of factors (like location and age), etc.
Cohorts is a valid sampling method when one needs to study problems like retention (new vs old) or some other “future quality” effect. To run experiment the users from cohort need to be splitted to control and experiment sub-cohorts.

There are multiple benefits of applying experiments on a smaller portion of users:

safety as far there could be some unknown issues
with smaller number of users the duration of experiment can be longer (daily and weekly seasonality can be covered)

Learning effect

A learning effect is a positive or negative effect from a change that only becomes visible only after a certain time has passed. The effect might be gradually increasing or decreasing with time (linearly, exponentially, etc.) or there might a more drastic level shift after a certain time.

A change is a novelty and users need to adjust to new interface or feature.

To study and track learning effect:

better to use user-based unit of diversion
better use cohorts
pre-period (AA test, check on cohorts) - should be no difference
post-period — again AA test to check if still there is no difference

AB testing is an iterative process and test environment may improved and adjusted:

cohorts and population may change (due to seasonality or any other external conditions or change of definition)
metrics choice or calculation can be improved (tooling)
better knowledge of data helps to filter out what is relevant to business
better tools and monitoring helps to run experiments more smoothly

PART 5. Analyzing results

Sanity checks

population sizing — Check if experiment and control population are comparable
invariance — Check that metrics that should not change had indeed not changed
- Examples: # signed-in users, # cookies, some other events (download time)

What if sanity check failed?
If results are not consistent with expectation then:

check day by day
slice by some parameter, like geo (country, language)
check tech infrastructure (some bugs may be discovered)
do not proceed with the experiment before there is clear understanding

If results are not statistically significant:

brake down to different platforms, days of the week
try to find bugs in the setup
think about other hypothesis
cross-check results with other methods (for example sign-test)

It can be that in each of experiment and control groups for new and experienced users the result is demonstrating improvement, but as of total numbers there is a opposite figures as result of not even split and size of the subgroups. This is known as a Simpson’s paradox — there are different subgroups in data, and within each group the results are stable. When the groups are mixed together the results are also mixed.

What if multiple metrics are checked for the experiment?

In this case apply technique called multiple comparisons. Typically problem is addressed by requiring a stricter significance threshold for individual comparisons, so as to compensate for the number of inferences being made.

1. Assume independence and use formula
alpha_all=1-(1-alpha)**num
Solve for alpha:
alpha=1-(1-alpha_all)**(1/num)

2. Bonferroni correction (could be too conservative)
alpha=alpha_all/num

Based on new (stricter) alpha we can:

recalculate z-score
recalculate margin of error
update confidence interval
check if this interval contains zero. If not then still significant

The Bonferroni correction is a very simple method, but there are many other methods, including the closed testing procedure, the Boole-Bonferroni bound, and the Holm-Bonferroni method. This article on multiple comparisons contains more information, and this article contains more information about the false discovery rate (FDR), and methods for controlling that instead of the familywise error rate (FWER).

If metrics move in different directions — one shows positive and the other negative results — then there is a need to understand what is the intuition behind that and what it means for business.
Multiple metrics help to understand behavior from different angles. Different metrics can be aggregated to a single one (OEC) and the best of such metrics is a combination that is valid for the company and inline with long-term decisions.

Decide what results tell you?

Do we understand the change (what, why, how)?
Do we want to launch?
— do we have significant and practical results?
— what the change will actually do?
— is it worth it?
— What is costs vs revenue?

How to launch?

Ramp-up — launch on small percentage and gradually increase it until 100%
Remove filters — apply change for one language, and then start testing for other

Gotchas
The effect may not be repeatable on the whole user-base due to lots of reasons like:

seasonality (vacations, holidays, etc.)
novelty effect
behavior can change

NB! Good practice is to have a hold-out (system control group) that continues to receive “unchanged” results. This enables to compare the behavior over time.
The other approach is to apply pre-change and post-change analysis.

What is the general recommendation on AB testing?

Check, check, double-check!

Additional ideas

If you need to run 2 experiments simultaneously then

Group A — Do not see any change. System control
Group B — Do not see change 1, but sees change 2
Group C — Sees both changes
Group D — Do not see change 2, but sees change 1

As far as users are distributed uniformly, the effect from “other” change will be averaged out in A+B vs C+D, which will be able to detect effect from the change.

Additional links

AB testing course
Calculations in Python — GitHub Notebook
A/B Testing Intuition with Python by Nam Vu
A/B Testing: The Definitive Guide to Improving Your Product

Distributions