A/B testing & Internshala’s product culture

Venkatesh Gupta
Internshala Tech
Published in
11 min readJul 18, 2021

Recently, we did an A/B experiment to improve the processing rate of applications on Application Tracking System (ATS).

The experiment affected the (desired) local metrics positively but impacted our North star metrics adversely. Btw, it was a really interesting conundrum.

As you will deep dive into this blog post, you will discover about A/B testing in deep & our tryst with the design revamp. I am sure you will absolutely love it.

What will you know?

  1. What is A/B testing?
  2. How long should you run an A/B test?
  3. What do we mean by statistically significant results?
  4. What is p-value?
  5. What is null hypothesis?
  6. What are the pitfalls of A/B test?

1. What is A/B testing?

At the face value, A/B testing sounds extremely simple!

Suppose, you have created a personal branding course and plan to sell it for ₹ 99. You are excited & have built a registration page for the same. You have a green coloured sign-up button on the page. Somehow, you came across an article that said — Bro! Blue evokes trust & brings more conversion.

You want to try blue! But you aren’t sure too! What if it makes fewer people signup now?

You want to play safe! So, you launch two versions of the same registration page — one with a green button and the other with a blue button.

You show one half of your users’ green sign-up button and the other half blue sign-up button. Boom!That’s an A/B test.

The button colour that gives more sign-up is the winner. That’s the button colour you would want to go with eventually!

1.1 What’s next?

Note ➖ When we run the test, the group of users who will see the old green signup button, as usual, are called as control group whereas the other group of users who will see the new blue coloured signup button are called the treatment group.

But what if the registration page with the blue sign-up button is bringing more registrations because it’s shown to more & more users who are genuinely interested to learn personal branding? Quite possible. Ain’t it?

Therefore, it’s important to split your traffic randomly to avoid sampling bias.

Avoiding sampling bias is simple! You just need to ensure that any user visiting your registration page has an equal chance of seeing either blue or green sign-up button.

1.2 How to split your traffic randomly in an A/B test?

  • If you are conducting an A/B test, for your logged-in users, you can split the traffic on the basis of user_id. If the user_id is an even number you can show them the existing design, else show the new design.
  • If you are conducting an A/B test, for new users (i.e. new users visiting your registration page), your developer can code and set a cookie in the user’s browser & the cookie can randomly decide which variant to show for a particular user.

Now, let’s move ahead to the next big question — How long should you run an A/B test?

2. How long should you run an A/B test?

This is quite an irritating question for data scientists. The product team always wants the results fast and they keep on asking — how long should we run the A/B experiment?

We always run an A/B test until we get a statistically significant result.

Wait! What on the earth is statistically significant result?

Does it mean we should run the test until we see a large/big improvement? (i.e. blue button has 10% more signup than green button)

After all, the literal meaning of significant is big/large/huge.

2.1 What do we mean by statistically significant results?

Reading the Wikipedia definition of statistical significance can give nightmares :p (Maybe, you haven’t read something more complex than this in a while!)

Let’s make it simple!

At Internshala, we had revamped the design of our Application Tracking System (ATS)! If you don’t know what an ATS is — Think of ATS as a place where all the resumes of job seekers are dumped and managed!

I wouldn’t go into the details of the design changes we made but the idea was to improve the processing rate (i.e. more applications should be hired/rejected/shortlisted out of total applications received)

* Processing rate — If 10 applications were hired OR rejected OR shortlisted out of a total of 100 applications, the processing rate is 10%.

Let’s suppose we ran three simultaneous different experiments. Can you identify which of the test results are statistically significant?

Guess, please!

  • Doesn’t normal wisdom say that Experiment C definitely has a statistically significant result? 205.56% improvement over the base rate!
  • Experiment A can also have a statistically significant result as the change over the base rate isn’t bad. It’s 6.64%. Right?
  • Experiment B can never have a statistically significant result as the change is just 3% more than the old design. Right?

After all, significant means big/large/huge.

2.2 What statistically significant difference isn’t?

Damn! Experiment C (as expected) and experiment B (what 😳) have shown statistically significant results.

It’s so important to remember that -

Statistically significant doesn’t mean large/big/significant change.

We can’t say anything about statistical significance by just looking at the absolute change or change over the base rate!

2.3 Then what do statistically significant results mean?

This may sound absolutely bizarre to understand and comprehend at this point in time but please bear with me!

A change/result is said to be statistically significant when the p-value of the test(experiment) is less than or equal to the significance level.

It says two terms (which we are completely unaware of at the moment)

  1. p-value
  2. significance level

Let’s understand both, to eventually get hold of the meaning of the term statistically significant results.

3. What is p-value?

To understand the p-value, let’s take an example of the typing zone of an i-pad.

Looking at the above analogy! p-value seems like a probability! Right?

What is my probability of tapping at a particular area?

What is the probability of an occurrence of an event?

Cut short, statisticians define p-value as the probability value for the null hypothesis to be true.

But what is null hypothesis!

4. What is null hypothesis?

Null hypothesis means — Everything is the same. Everything is constant.

But isn’t it different from what we product managers think! We think that every new feature is a #masterstroke and hence null hypothesis really seems disappointing for us.

We would always want to cancel/reject the null hypothesis.

  • PM’s hypothesis:- New design of ATS will impact the processing rate.
  • Null hypothesis:- New design of ATS will have no impact on the processing rate. (Everything will be the same. Everything will be constant)

Let us remember our old definition of p-value now.

p-value is the probability value of the null hypothesis to be true.

Remember this!

4.1 How do we find p-values?

There are various statistical tests to find p-values! (You can ignore this as it might be an extreme information overload at this point in time)

Some of these tests are:-

  • T-test (comparison of mean)
  • F-test (comparison of variance)
  • ANOVA (analysis of variance)

4.2 What does p-value mean?

Remember our definition of the null hypothesis -

  • The new design of the ATS will have no impact on the processing rate.

Note : If the p-value is 0.3, it means that if we repeat this experiment 100 times, 30 out of 100 times, we will see no change in the processing rate due to the new design.

In other words, it means that 30 out of 100 times null hypothesis will be true.

Now, we fully understand what p-value means:-

P-value is the probability value for the null hypothesis to be true.

Remember:- If the p-value is high, it means that the null hypothesis is true more number of times and our changes aren’t causing any impact as everything is the same, everything is constant.

We would always want a low p-value for our A/B experiments.

5. Is the test result significant?

We are back to our original question. How can we know — is our experiment result significant or not?

Let’s recap:-

  1. We wanted that ATS with the new design should have a better processing rate.
  2. The null hypothesis said that there will be no change in the processing rate no matter what the design is.
  3. Then we found a p-value! What is the probability of the null hypothesis to be true?
  4. Say, if the p-value is 0.03 (which we can find through the various statistical tests), it means that if we repeat this experiment 100 times, 3 times we will see no impact on the processing rate.

Remember, p-value = 0.03

Let’s introduce a new term now — significance level (α).

The significance level (α) is the percentage of risk we are willing to take while rejecting the null hypothesis.

Normally, we take α = 0.05. i.e. we are comfortable with a 5% risk of being wrong while rejecting the null hypothesis.

5% risk of being wrong means we are 95% confident.

We call this a 95% confidence interval.

p-value (0.03) < significance level (0.05)

If, p-value <= significance level - we call the experiment as statistically significant.

While the statistics behind A/B testing are of hardly any utility to Product Managers; curiosity should never die.

You can play around with the A/B test i.e. (How long should you run the test? When will the test results be significant?) here — https://abtestguide.com/calc/

Let’s see the pitfalls of the A/B test now.

6. Pitfalls of A/B test

1. A/B test can take forever

For an A/B test, you need to have a lot of users. And even though, you have a lot of users, it takes a lot of time for the north star metrics to move. (if your experiment is a major one & affects the north star metrics)

You then have to spend some time finding the co-related metrics. And sadly, if you don’t get co-related metrics, the A/B test can take forever or you can never measure the impact on the north star metrics.

Think of north star metrics as the most important metrics for a product. Looking at this single metrics, you can know the overall health of the product (in a sec)

  • Spotify = ‘Time spent listening’
  • Airbnb = ‘Number of nights booked’

2. A/B test involves too much cognitive work

A/B test only tells what and not why!

A/B test isn’t the panacea to all product development issues. The PMs have to spend a lot of time setting the hypothesis (Why’s) right! If you don’t define your hypothesis that why certain things can happen, A/B test won’t come to your rescue. A/B test only tells what did the users do!

2.1 Correlation always does not imply causation

A famous study found that as soon as ice-cream sales increases in New York, the homicide rate (people killing one another) increases. Strange! 😮

Although the sale of ice cream is positively co-related to homicide rates in New York; eating ice cream can’t make people kill one another. It isn’t a sci-phi world, yet!

Later, it was found that as more people prefer moving out of their homes in summer, more people die of road accidents in summer than in winter in New York.

We can see that two independent variables — ice-cream sales and homicide rate seem related but it’s the third variable (i.e. weather) that is causing the change.

A/B tests are notoriously famous for this:-

  • The experiment can cause a positive change in the immediate metrics.
  • It can also cause a net improvement in the co-related metrics but adversely impact the north star metrics.

We carried an experiment for 4 months (at Internshala) with the above-mentioned dilemma but could never figure out what is causing an adverse impact on the north star metrics even though immediate & co-related metrics were positively impacted. Sigh!

3. Marginal improvements never portray the right picture

Even though you get statistically significant results, a marginal (1–5%) improvement over the base rate can be a fluke. There can be so many other factors like the novelty effect causing the change than what the experiment intended to do.

Novelty effect

Users react dramatically whenever a new change is made. Since humans love the status quo and resist change, they may completely shunt the feature for time being or react dramatically (like exploring everything that’s new) and these may cause abrupt changes in the metrics.

Thumb rule is to discount the first week whenever you make massive changes to the treatment group from the control group.

4. Noise

Noise can happen because

  • What if the users aren’t divided uniformly? A set of users might be skewing the entire results.
Noise
  • Outliers (they can ruin anything)

5. There’s so much work & re-work

A/B test involves a lot of work in both the cases of win or loss.

If you win, you will have to clean up the code & roll it out to 100% of your users.

And if your experiment fails, (doom your hypothesis) you need to clean up the code again & remove all the changes you have made.

The real works begin in an A/B test after the experiment ends.

Users are strange. It’s hard to find out what caused the change? If the metrics didn’t move or moved adversely, what caused those changes!

To sum up, the real magic of the A/B test isn’t in building but it’s in digging deep & figuring out what caused the change.

Rapid experimentation is a core product cultural tenant at Internshala. Do check our careers page; if you are passionate to solve hard problems of HR-tech & Ed-tech!

--

--

Venkatesh Gupta
Internshala Tech

Product @ Internshala | Career-tech (Ed-tech + Recruitment-tech) & creator economy enthusiast | Writes on Xplainerr