A/B Testing Experiment — A Udacity Course Project

8 min readSep 19, 2020

Experiment Overview: Free Trial Screener

At the time of this experiment, Udacity courses currently have two options on the course overview page: “start free trial,” and “access course materials.” If the student clicks “start free trial,” they will be asked to enter their credit card information, and then they will be enrolled in a free trial for the paid version of the course. After 14 days, they will automatically be charged unless they cancel first. If the student clicks “access course materials,” they will be able to view the videos and take the quizzes for free, but they will not receive coaching support or a verified certificate, and they will not submit their final project for feedback.

In the experiment, Udacity tested a change where if the student clicked “start free trial,” they were asked how much time they had available to devote to the course. If the student indicated five or more hours per week, they would be taken through the checkout process as usual. If they showed fewer than 5 hours per week, a message would appear indicating that Udacity courses usually require a more significant time commitment for successful completion, and suggesting that the student might like to access the course materials for free. At this point, the student would have the option to continue enrolling in the free trial or access the course materials for free instead. This screenshot shows what this experiment looks like.

Experiment Setup

The primary aim of Udacity is to improve the overall student experience and improve coaches’ capacity to support students who are likely to complete the course.

Null Hypothesis: This approach might not make a significant change and might not be effective in reducing the early Udacity course cancellation.

Alternative Hypothesis: This might reduce the number of frustrated students who left the free trial because they didn’t have enough time, without significantly reducing the number of students to continue past the free trial and eventually complete the course.

Unit of Diversion (from Udacity): The unit of diversion is a cookie, although if the student enrolls in the free trial, they are tracked by user-id from that point forward. The same user-id cannot enroll in the free trial twice. For users that do not register, their user-id is not tracked in the experiment, even if they were signed in when they visited the course overview page.

Experimental Design

Metric Choice

Invariant metrics

Invariant metrics are the ones used for sanity checks and will remain invariant throughout the experiment.

Number of cookies: That is, number of unique cookies to view the course overview page.

Number of clicks: That is, the number of unique cookies to click the “Start free trial” button (which happens before the free trial screener is a trigger).

Click-through-probability: That is, number of unique cookies to click the “Start free trial” button divided by number of unique cookies to view the course overview page.

Evaluation Metrics

Evaluation metrics are the ones that we care about, the metrics that must be observed for consideration in the decision to launch the experiment.

Gross conversion: That is, the number of user-ids to complete checkout and enroll in the free trial divided by the number of unique cookies to click the “Start free trial” button. (dmin= 0.01)

Retention: That is, the number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by number of user-ids to complete checkout. (dmin=0.01)

Net conversion: That is, number of user-ids to remain enrolled past the 14-day boundary (and thus make at least one payment) divided by the number of unique cookies to click the “Start free trial” button. (dmin= 0.0075)

This graph below shows the process.

Calculate the standard deviation

Udacity provided the baseline values for each metric:

For each of the metrics selected as an evaluation metric, calculating the standard deviation given a sample size of 5000 cookies visiting the course overview page.

Gross Conversion (clicks): N=5000*0.08=400, p=0.20625, standard diviation=0.0202

Retention (enroll): N=5000*(660/40000)=82.5, p=0.53, standard diviation=0.0549

Net Conversion (clicks): N=5000*0.08=400, p=0.109313, standard diviation=0.0156

Determination of the sample size

Calculate the sample size for the selected evaluation metrics, use alpha = 0.05 and beta = 0.2.

Using the sample size calculator, here are the results:

Clicks for Gross Conversion = 25835

Clicks for Retention = 39115

Clicks for Net Conversion = 27413

Since we will be using two groups, control and experiment in our experiment, therefore the pageviews required for each metric are:

Gross Conversion = (28835*2)/0.08 = 645875

Retention = (39115*2)/(660/40000) = 4741212

Net Conversion = (27413*2)/0.08 = 685325

In order to test for all three metrics, we would require the maximum number of pageviews, 4741212.

Choosing experiment duration and exposure

What percentage of Udacity’s traffic would you divert to this experiment (assuming there were no other experiments you wanted to run simultaneously)? Is the change risky enough that you wouldn’t want to run on all traffic?

Given the percentage you chose, how long would the experiment take to run, using the analytic estimates of variance? If the answer is longer than a few weeks, then this is unreasonably long, and you should reconsider an earlier decision.

Based on the number of pageviews for retention we calculated previously, if we divert all of the pageviews, given 40000 pageviews a day, it would take around 119 days to run the experiment, which is way too long. Besides, the experiment indicates that the payments are made 14 days after enrollment. Therefore we can expect the experiment to run for at least 14 days.

119 days would be too long to run the experiment; we should reconsider the decision about using 4741212 pageviews and use 685325 instead. For Gross Conversion and Net Conversion, we can use 100% of our traffic, and the experiment will take 17 days. The duration is still short. We can divert 80% of our traffic each day, and the experiment will last 21 days.

Experiment Analysis

Sanity checks

To check whether the invariant metrics are equivalent between the two groups, we will conduct a sanity check. We expect that the total number of cookies in the control group and the experiment group each account for 50% of the total number of cookies. The data for analysis is here.

To perform sanity checks for Pageviews and Clicks, we first need to calculate the probability for the control group and the experiment group.

The standard error for binomial distribution is:

We can get the upper bound and lower bound for the confidence interval and can clearly see that the probability for both Pageviews and Clicks falls within this interval. Therefore both metrics passed the sanity check.

To perform sanity check for the Click Through Probability, we would expect that the difference between the two groups be zero.

The standard error formula for two proportions is:

After calculation, we can see that the observed difference falls in the confidence interval, therefore the Click Through Rate passes the test.

Effect size tests

For the evaluation metrics, we will calculate the confidence interval for the difference between the control and experiment groups on 95% confidence interval, then check whether each metric is statically or practically significant.

A metric is statistically significant if the confidence interval does not include 0 (that is, you can be confident there was a change), and it is practically significant if the confidence interval does not include the practical significance boundary (that is, you can be confident there is a change that matters to the business.)

Previously, we had chosen the Gross Conversion and Net Conversion as our final evaluation metrics to measure.

d = probability for experiment — probability for control

For Net Conversion, the difference between control and experiment groups is insignificant. The confidence interval include both 0 and negative dmin. For Gross Conversion, the confidence interval does not include 0 or practical significance, therefore it is significant.

Sign Test

The sign test is to check whether the signs of the difference of the metrics between the experiment and control groups agree with the confidence interval of the difference. I used this calculator online and got these results.

For Gross Conversion, rates from the experience group are higher than control groups for 4 times, p-value=0.0026, significant.

For Net Conversion, rates from the experience group are higher than the control group 10 times, p-value=0.6776, not significant.

Recommendations

This experiment is designed to understand whether the screener will help to filter out students who wouldn’t commit to the study time, while not reducing the number of students who will make the payment after completing their free trial. Our results show that Gross Conversion will be reduced significantly. However, there are no significant changes in Net Conversion. Therefore, the screener will help reduce the enrollment, but not enough evidence to show that there will be more students who make the payments. I would not recommend launching this screener.