Concepts in Experiment Design

Published in

The Startup

8 min readOct 5, 2020

source: https://www.topbots.com/best-multivariate-ab-testing-tools-for-conversion-rate-optimization/

Introduction

Correlation does not imply Causation i.e because two variables are related to one another does not necessarily mean that when one changes the other one is bound to change as well. However, there may be cases where you want to say that one variable causes another to change. For example: when designing a website change, you might want to say that it results in visitors making more purchases. To test your hypotheses and understand the scope of the conclusion that can be made from the data, you should run an experiment.

Types of Study

There are many ways in which data can be collected in order to test or understand the relationship between two variables of interest. These methods can be put into three main types, based on the amount of control that you hold over the variables in play:

If you have a lot of control over features, then you have an experiment.
If you have no control over the features, then you have an observational study.
If you have some control, then you have a quasi-experiment.

For the Experiments type study there are two different methods :

Between Subjects: In a between-subjects experiment, each unit only participates in, or sees, one of the conditions being used in the experiment. There are generally two groups in which the participants are divided. In one group, we have no manipulation, this is known as the control group. The other group includes the manipulation we wish to test, such as a new drug or new website layout. This is known as our experimental group. We can compare the outcomes between groups in order to make a judgment about the effect of our manipulation. For web-based experiments, this kind of basic experiment design is called an A/B test: the “A” group representing the old control, and “B” representing the new experimental change.

Within Subjects: If an individual completes all conditions, rather than just one, this is known as a within-subjects design. Within-subjects designs are also known as repeated measures designs. By measuring an individual’s output in all conditions, we know that the distribution of features in the groups will be equivalent. For example, if an individual rates three different color palettes for a product, we can know if a high rating for one palette is particularly good compared to the others (e.g. 10 vs. 5, 6) or if it’s not a major distinction (e.g. 10 vs. 8, 9).

Types of Sampling

During research, it could be unreasonable in both time and money to try and collect thoughts from every single person in the population. This is where sampling comes into picture. The goal of sampling is to only take a subset of the population and then using the responses from that subset make an inference about the whole population. Two basic sampling methods that are frequently used are:

Simple Random: Each individual in the population has an equal chance of being selected. Random selection is made until the desired sample size is obtained. Since everyone has an equal chance of being drawn, we can expect the feature distribution of selected units to be similar to the distribution of the population as a whole.

Stratified Random: We need to first divide the entire population into disjoint groups, or strata. That is, each individual must be a part of one and only one group. Based on the overall proportions of the population, you calculate how many people should be sampled from each group. Then you use random sampling to select a sample from each group.

Measuring Outcomes

The goals of the study may not be the same as the way you evaluate the study’s success. The objective features by which you evaluate performance are known as evaluation metrics. As a rule of thumb, it’s a good idea to consider the goals of a study separate from the evaluation metrics. This provides a couple of useful benefits.

First, this makes it clear that the metric isn’t the main point of a study: it’s the implications of the metric relative to the goal that matters. This is especially important if a metric isn’t directly attached to the goal.
Secondly, having the metric separate from the goal can clarify the purpose of conducting the study or experiment. It makes sure we can answer the question of why we want to run a study or experiment.

Additional Concepts

There are additional concepts and terms that are commonly used for designing experiments, especially for web-based studies.

1. Funnel

In a web experiment, we need to create a user funnel. A funnel is the flow of steps you expect a user to take. Typically, the funnel ends at the place where your main evaluation metric is recorded, and includes a step where your experimental manipulation can be performed. For example, we might think of the following steps for someone to purchase a product in an online store:

Visit the site homepage
Search for a desired product or click on a product category
Click on a product image
Add the product to the cart
Check out and finalize purchase

One property to note about user funnels is that typically there will be some dropoff in the users that move from step to step. This is much like how an actual funnel narrows from a large opening to a small exit. Outside of an experiment, funnels can be used to analyze user flows. Observations from these flows can then be used to motivate experiments to try and improve the dropoff rates.

2. Unit of Diversion

Once you have a funnel, we need to figure out a way to assign users to either a control group or experimental group. The place in which you make this assignment is known as the unit of diversion. The different options for diversion are as follows:

Event-based diversion (e.g. pageview): Each time a user loads up the page of interest, the experimental condition is randomly rolled. Since this ignores previous visits, this can create an inconsistent experience, if the condition causes a user-visible change.
Cookie-based diversion: A cookie is stored on the user’s device, which determines their experimental condition as long as the cookie remains on the device. Cookies don’t require a user to have an account or be logged in, but can be subverted through anonymous browsing or a user just clearing out cookies.
Account-based diversion (e.g. User ID): User IDs are randomly divided into conditions. Account-based diversions are reliable, but requires users to have accounts and be logged in.

3. Types of Metrics

As already discussed we need to define our features to evaluate the performance/goal of the experiment

There are two major categories of metrics that can be considered:

Evaluation metrics: Ideally, we hope to see a difference between control and experiment groups that will tell us if our manipulation was a success. For example, we might want to see an increased click-through-rate from search results to products, or an increase in overall revenue.

Invariant metrics: Metrics that we hope will not be different between groups. Metrics in this category serve to check that the experiment is running as expected. For example, in an experiment with cookie-based diversion, the number of cookies generated for each group would be a good invariant metric.

Validity of the Experiment

When designing an experiment, it’s important to keep in mind validity, which defines how well our conclusions can be supported. There are three major conceptual dimensions upon which validity can be assessed:

Construct Validity: It is tied to how well our goals are aligned to the evaluation metrics. Poor construct validity can come about when an evaluation metric does not actually measure something related to the desired outcome concept.
Internal Validity: This refers to the degree to which a causal relationship can be derived from an experiment’s results. Controlling for and accounting for other variables is key to maintaining good internal validity.
External Validity: It is concerned with the ability of an experimental outcome to be generalized to a broader population. This is most relevant with experiments that involve sampling: how representative is the sample to the whole population

Bias in the Experiment

The different types of bias that can be encountered in the experiment are:

Sampling biases are those that cause our observations to not be representative of the population. Studies that use surveys to collect data often have to deal with the self-selection bias. The types of people that respond to a survey might be qualitatively very different from those that do not. One type of sampling bias related to missing data is the survivor bias. Survivor bias is one where losses or dropout of observed units is not accounted for in an analysis.
Novelty bias is one that causes observers to change their behavior simply because they’re seeing something new. We might not be able to gauge the true effect of a manipulation until after the novelty wears off and population metrics return to a level that actually reflects the changes made.
Order biases appears when running a within-subjects experiment. The order in which conditions are completed could have an effect on participant responses. A primacy effect is one that affects early conditions, perhaps biasing them to be recalled better or to serve as anchor values for later conditions. A recency effect is one that affects later conditions, perhaps causing bias due to being fresher in memory or task fatigue.
Experimenter bias occurs, especially in face-to-face experiments. This is where the presence of the experimenter can affect participants’ behaviors or performance. If an experimenter knows what condition a participant is in, they might subtly nudge the participant towards their expected result with their interactions with the participant. In addition, participants may act differently in the presence of an experimenter, to try and act in the ‘right’ way — regardless of if a subject actually knows what the experimenter is looking for or not.

Ethics in Experimentation

Before you run an experiment, it’s important to consider the ethical treatments to which you subject your participants. While different fields have developed different standards, they still have a number of major points in common:

Minimize participant risk
Have clear benefits for risks taken
Provide informed consent
Handle sensitive data appropriately

SMART for Experiment Design

There’s a mnemonic called SMART for teams to plan out projects that also happens to apply pretty well for creating experiments. The letters of SMART stand for:

Specific: Make sure the goals of your experiment are specific.
Measurable: Outcomes must be measurable using objective metrics
Achievable: The steps taken for the experiment and the goals must be realistic.
Relevant: The experiment needs to have purpose behind it.
Timely: Results must be obtainable in a reasonable time frame.

Conclusion

In this post we were able to cover the following concepts with respect to an experiment design:

What is Experiment, different types of experiment and sampling techniques
Goals and Metrics used to evaluate an experiment
Pitfalls of validity and bias
Ethical experiment design

Are you ready to design an experiment?

Reference : Udacity Datascientist Nanodegree