Two-Ply Or Not Two-Ply: Understanding A/B Experiment Concepts using Toilet Paper

Sawyer X
6 min readMay 8, 2023

--

Toilet paper roll with various equations on it

Having set the groundwork for A/B testing in a previous post, it’s time to start building our understanding of A/B testing in practice.

How do we create an experiment? What do we need to think about? Let’s flush out the components of controlled experiments and see how we can apply them to the correct position for a toilet paper roll.

Key Concepts

Let’s cover each concept and how it relates to an experiment in which we determine which toilet paper position (“over” or “under”) is the correct toilet paper position.

Hypothesis

Every experiment starts with a hypothesis. It is a guess or a prediction we make before testing to see if it’s true.

In this case, our hypothesis is that the “over” position (with the toilet paper facing you) is the superior position for toilet paper because it will be farther from the wall, which wipes away the anxiety of people needing to… well, wipe away.

Eligibility condition

The eligibility condition is the minimum bar for someone to attend an experiment. Every experiment has a targeted population, and often, the minimum bar would be that targeted population itself.

If I were to experiment with a new type of running shoe, I would probably want to limit my population to people who run. It wouldn’t be beneficial if I tested it on people who only wear sandals (we all know that one person!).

In our case, the minimal eligibility would be those with toilet paper in the “under” position because our experiment tests changing it to the “over” position.

Two boxes showing 200 people. One box is for 100 people, saying “over” position starting point, the other box is for 100 people, saying “under” starting point. Both boxes are the entire population. “over” starting point are excluded and “under” starting point are those we will experiment on.

Had we hypothesized that the “over” toilet paper roll position is better because it’s easier for cats to unroll and destroy, we would also need to limit the population to people with cats that inevitably find their way to the toilet.

This means we first find only people with cats, and then we need to experiment on those that use the “under” position so we can experiment with changing it to “over.” The more conditions, the smaller the sample size.

Fortunately — and unfortunately — cats are not part of this experiment.

Treatment

Treatment is the “B” in the A/B test (or the number 2 of it if you will). The treatment is what change we make that we wish to observe. In our case, it is setting the toilet paper to the “over” position, which we assume is superior.

Control group

To test our hypothesis, we need to establish a control group that provides the baseline for comparison.

Our control group will be half of our experiment population and will keep their toilet paper in the “under” position so we could see how big our change (to the “over” position) in the test group was.

Test group

Our test group will receive the treatment of having the toilet paper changes to the “over” position and measured against the control group.

If we had 100 people at the beginning “under” position — 50 of which get to keep it — the other 50 that we will switch to the “over” position will be our test group.

Two boxes (of 50 people each) both being the “under” starting point. One is the test group (of 50 people) that will keep it “under” while the other half is the control group which will be switching to “over”.

Random assignment

We can’t assign people to the test or control group using any deterministic mechanism, as people may have preexisting biases. For example, certain cultures might have a default position, so if we assign based on culture (or country, which can have a similar bias), we might group too many people with a specific preference into one of the groups. (And who knows, maybe the toilet paper drains clockwise in the Southern Hemisphere…)

Instead, we need to assign people to either of the groups randomly. This eliminates bias based on culture, geographical location, gender, age, etc.

After all, we don’t want to make a mess of it!

Dependent and independent variables

The independent variable (a variable in the experiment that we change) is the orientation of the toilet paper (“over” or “under”). In contrast, the dependent variable (what we do not alter in the experiment) is the ease of use and the cleanliness of the bathroom.

Sample size

Despite what you think I mean by “sample size” in this experiment, the sample size is actually how many people will be in the experiment. The larger the sample size, the more statistically significant our results will be.

In other words, we generally need more people for greater certainty of our results. (Of course, we also target them, wanting to be the right people.)

In our example, we started with 200 people and only sampled 100 of them, but we probably need a much larger sample size (in other words, the number of people) for the experiment to be valid.

Beware the small sample size!

Statistical significance

We will need statistical tests to determine whether any differences between the experimental and control groups are statistically significant.

Researchers commonly use statistical tests (fancy math) to determine that any effect we see (like people using toilet paper with more gusto and zeal) is not a fluke.

Statistical significance is typically defined as a p-value less than 0.05, meaning there is less than a 5% chance that the differences between the groups are due to chance.

Effect size

The effect size helps us determine the practical significance of our results — whether the difference between the two groups is large enough to be meaningful.

In simpler terms, once we confirm that the effect of changing the toilet paper position is real, we need to determine whether that change was helpful enough that we switch to it. It’s the “but do we care?” question.

Effect size also helps us to compare the results of different experiments or interventions. If two interventions show statistically significant results, we can use effect size to determine which intervention had a more substantial impact.

Steps for conducting an experiment

How would we go about creating this experiment? Let’s unroll the steps.

  1. Create a hypothesis: What do we think or assume is correct, and what experiment will prove or disprove it?
  2. Define the targeted population: Who are we testing this treatment on?
  3. Define the control and treatment: What is the default state, and what change do we intend to make? Make sure the sample size is adequate.
  4. Assign people randomly to each of these groups.
  5. Provide the treatment to the test group and collect data on their behavior.
  6. If it’s statistically significant, check the effect size to decide if we want to make this change.

Anticipating some questions

Is this the best example you could come up with?

Probably not, but even experiments from minor everyday disagreements can yield knowledge. Except in this case. In this case, we know the answer. It’s presented below.

Must I use a 50/50 split between the control and test groups?

No. However, this requires math and tooling support (or a good grasp of statistics), so it’s easier to keep it at 50/50.

Can I have more than one test group, testing multiple options?

You can. Not in this experiment because the toilet paper can only be in two positions, but you can in other scenarios.

It’s more complicated to have multiple test groups, but don’t fret. I’ll go over that in the future.

What happens if my population includes non-targeted groups?

Take the running shoe example. Would the running shoe test positively with that one person we know who only wears sandals, and they will switch to shoes? Probably not. It will likely just pollute the data we get.

This is what we call noise. It is additional data (like that one person’s experience) that isn’t adding useful information but requires us to collect even more samples to get valuable data.

The more targeted we are, the more useful our data is. The downside is needing to find enough people for this targeted group. However, if you slice it too many times, you might end up with too small a sample size. Beware the small sample size!

Who was right about the toilet paper position?

Joseph Gayetty, who invented and patented commercial toilet paper in 1857.

The toilet paper patent submission with illustration clearly showing the toilet paper in the “over” position

If you thought the “under” was correct, you crapped out. It is decidedly “over,” much like this discussion.

--

--