Product Experimentation Analysis

10 min readAug 8, 2022

Coming from a scientific background, experimentation in a business setting was initially confusing to me. As a ‘recovering’ physicist, the mental picture that appears when I hear “experiment” is always that of a scientific laboratory, where scientists wearing gloves add chemicals to containers and yell ‘Eureka!’ (this only happens rarely!). But what are the experiments being carried out in a technology company? I was curious. Also, what are these A/B tests that I hear so much about from data blogs and from my friends working in technology? How are they running those tests? It all started to make sense when I found out that an A/B test is simply a Randomized Control Trial (RCT), which is the gold standard trial for evaluating the effectiveness of interventions, ta-da! I knew about RCT from my statistical courses and used them for analyzing data from my own research. But still, how does a technology company use experimentation?

Experimentation (in a business setting) is a process by which we make iterative changes in a product and infer the effects of interventions.

That’s it! However, definitions are not enough, so let me now give you a walk-through of an analysis of a product experimentation done at an advertising company. Let us look at how experimentation is designed and analyze its effects.

Background

Our advertising company provides a platform where businesses can create advertising campaigns to increase awareness of their brands or to increase the adoption of their products or services. Currently, the company has an advertising product where advertisers pay them every time a user clicks on their ad. Each campaign has a budget (the amount of money the advertiser is willing to spend during a period of time). An advertiser never has to pay more than their initially allocated budget, so if our company were to spend more than the campaign’s budget, we would not be able to bill the advertiser for the additional spend. This is called overspending. In reality, it is difficult to avoid overspending because there is a latency between when we send ads to users and when they actually click on those ads. Since the company only charges advertisers for ‘clicks’ on their ads, the charges can enter into the system after some (random) delay. Lately, the company has been noticing an increase in the overspend on the platform. To reduce the amount of overspending, the company decided to create a new product where advertisers pay every time their ad appears in a user’s viewport rather than each time it is clicked on. Presumably, these engagements would be received at a lower latency.

How do we test whether this creates a positive product change? You guessed it; we run an A/B test!

Hypothesis

It is essential to form our hypothesis before we design the experiment and analyze the data to find the effect of interventions.

By introducing the new ad product with lower latency, the company can reduce overspending, and thus increase revenue by efficient reallocation of advertising resources.

We randomly split the advertisers on the platform. Half of the advertisers remained on the old product and the other half received the new product. A week later, we collect the data and want to determine whether or not the experiment was a success. Let us look at the data schema and start the analysis.

Data Schema

Assumptions

The goal of the new ad product is to reduce overspending, thereby increasing the revenue of the company and bringing it up to industry standards.
Each company runs only a single ad campaign. This makes sure that we have no interactions between the groups. In other words, the treatments are independent.

Exploratory Analysis

The dataset consists of 4 features:

Group — control/treatment
Company size — small/medium/large
Campaign spend — positive dollar value
Campaign budget — positive dollar value

We create three new features:

Overspend = Spend-Budget (positive or negative dollar value)
% Overspend = Overspend/Budget*100 (positive or negative value)
Revenue (from the spending and budget values)

The feature “Percentage Overspend” (goal-metric) is created to have a scaled feature, which may be compared across segments (company sizes). The number of small companies in the platform advertising space outweighs the others, followed by large companies, and medium companies at a distant third (Fig. 1).

Fig. 1 Advertisers are segmented based on their company sizes in control and treatment groups (data expressed in percentages).

This, however, doesn’t mean that the revenue follows the same trend, because small companies have lower budgets. A simple aggregation of data by finding the median of percentage overspending shows that the overspending decreases considerably in both small and large companies, but increases slightly in medium companies (Fig. 2).

Fig. 2 The median percentage overspend for all company sizes in control (False) and treatment (True) groups.

Experimental Analysis

We may use a two-sample, one-sided t-test to see if the percentage overspend decreases in the treatment group, which is defined as the success of our experiment.

This requires a few assumptions of our data:

Independence — Taken care of when experiment is designed
Random Sampling — Taken care of when experiment is designed
Normality — Are the samples normal?

As Nassim Taleb elaborated to us in the Black Swan, reflecting on the disasters of assuming a Gaussian distribution for any population and getting surprised by rare events, this is an important step to perform in our experiments. We do not expect to have repercussions like that in our analysis, however it is a necessary practice to check for the normality of our population nevertheless. A histogram of the two group samples (Fig. 3) signals the non-normality of the data. This can be checked using the Shapiro-Wilk test for both groups - we choose not to segment the data as they exhibit similar skewness.

Fig. 3 The distribution of data in control and treatment groups across the dimensions.

The null hypothesis (H0): The population is normally distributed

We choose a p-value of 0.05 to reject the null hypothesis, corresponding to a 95% confidence interval in our decision-making.

Fig. 4 Running the Shapiro-Wilk tests for control and treatment groups using Scipy library.

The p-value is practically zero for both samples, thereby rejecting the null hypothesis. We learn that the samples are not normally distributed. Now, let’s look at the next assumption.

4. Homogeneity of Variances — Do the samples have equal variances?

As we established that the samples are not normally distributed, we can not use tests for normal populations, such as Bartlett’s test, to check the variances of two sample populations. Thus we choose Levene’s test:

Levene’s test is used to test whether two samples are from populations with equal variances for a non-normal distributions.
H0 (Null Hypothesis): The population variances are equal
H1 (Aternate Hypothesis): The population variances are not equal
We decide a p-value threshold of 0.05 to reject the null hypothesis.

Levene’s test determines the p-value to be negligible thereby rejecting the null hypothesis stating that the samples have equal variances (Fig. 5).

Fig. 5 Running Levene’s test to check the variances of two classes.

Moving forward, there are three ways we can approach the experimental analysis:

1. Proceed to non-parametric tests that do not assume that the data fits a specific distribution. A common non-parametric test is Mann-Whitney U Test, preferably used for smaller samples.
2. Due to the sheer size of the samples available (7000 observations in each group), by Central Limit Theorem, we may assume that the distribution tends to normality, thereby use Welch’s two sample t-test (considering the unequal variances).
3. Transform the sample groups (square-root or log, because of the right skew) so that they may resemble a normal distribution, then use the t-test.

I chose to proceed with the two-sample, one-tailed Welch’s t-test for our analysis (due to the large sample size and unequal variances). The Mann-Whitney U Test may also be performed.

Does the new product reduce overspending?

Null Hypothesis (H0): The overspending in treatment group is equal to or higher than the control group.
Alternative Hypothesis (H1): The overspending is lower in the treatment group.

Welch’s t-test was performed on the % overspend of all company segments. The effectiveness of the treatment can be inferred from both the p-value and the test statistic, as seen in Fig. 6. We can safely say that, on aggregate, the new product is effective in reducing overspend. This holds true for large and small companies, separately. However, it failed to reduce overspending for the medium companies.

Fig. 6 The result of the hypothesis test on the % overspend variable using Welch’s t-test.

Does the new product affect the campaign budgets?

This is a good guard-rail metric to check any potentially misleading or erroneous results and analysis. We check the campaign budget of each group to see how the new product affects the budgets across different companies, and on aggregate. The preliminary aggregation using the median (Fig. 7) suggests a decrease in the budgets. However, this needed to be checked rigorously.

Fig. 7 The changes in the budget for both control (False) and Treatment (True) groups.

By performing the variance and normality tests, we see that the same assumptions (non-normality, unequal variances, and sample sizes) hold for the new variable, the same as in the previous analysis. We proceed to apply Welch’s t-test to the budget.

Null hypothesis: The budget in the treatment group is greater than or equal to that of the control groups.
Alternate hypothesis: The budget is lower in the treatment group.

We hope to see the null hypothesis remain unrejected for the experiment to be a success.

We see that certain advertisers in the treatment group (the small companies) are entering a lower budget, wary of the new product (Fig. 8). The reduction in small companies’ budget is likely not due to random fluctuations as evidenced by the negligible p-value of 0.007. However, the new product has a neutral effect on medium and large companies’ budgets. On aggregate, the treatment group’s budget is greater than or equal to the control group.

Fig. 8 The Welch’s t-test performed for campaign budget across all groups, p-value and test statistic shown.

Uplift in Budget or Revenue

To make sure that this ‘new product phobia’ of small companies does not affect the total budget and revenue, we can compare the total budgets and revenue of the two groups.

Total budget of the control group = $ 35.8 million
Total budget of the treatment group = $ 53.4 million

Fig. 9 The total budget in the control and treatment groups across the segments.

Fig. 10 The total revenue in the control and treatment groups across the segments.

Total revenue from the control group = $ 28.1 million
Total revenue from the treatment group = $ 42.4 million

A 49% increase in budget for the new product.
A 51% increase in revenue from the new product.

Conclusion

We notice that the new product, which has low latency, is successful at reducing the overspending on smaller and larger companies, and on aggregate. However, it has a neutral effect on medium-sized companies. We also note that small companies are wary of the new product, and have reduced their advertising budget. The new product hasn’t caused a reduction in the budget for large and medium companies, as well as on aggregate. In fact, large companies have aggressively increased their budget. The budget and revenue are closely correlated, therefore it is not necessary to spend additional analysis on revenue.

We are done with our experimentation and quantified the effect of the intervention. But this isn’t over just yet; the value of experimentation comes from what recommendations we can provide based on our analysis. Let’s have a look at a few recommendations from our analysis to the business stakeholder.

Recommendations

1. We recommend selling the new product to large companies because it was effective in reducing overspending, and their campaign budgets remained unaffected.

2. We recommend that the company keeps using the old product for medium-sized companies.

3. We recommend not selling the new product to small companies at this point in time. However, this decision needs longer experimentation in the small company segment. We may also reach out to them to alleviate their fear of the new product.

4. We may also run a paired t-test on the small companies to see if their behavior might change in the future.

5. Re-testing is highly recommended. Even with a statistically significant result, there’s a probability of false positive error. Retesting can rule out that possibility.

6. Additionally, it is important to understand that ‘overspending’ is not a loss in revenue, but just an inefficient allocation of resources, an avenue to improve upon for better growth.

Hope you enjoyed reading and learned something new about RCT in a product experimentation setting. Let me know your thoughts on how we can improve the analysis further.

Product Experimentation Analysis

Written by Yadu Sarathchandran