ChatGPT & Me — Part 2: Learning a new concept from scratch using ChatGPT in 1 hour & applying it

Daniel Riad
ChatGPT & Me
Published in
14 min readApr 3, 2023

Learning the basics of A/B Testing from scratch

Motivation

As a Data Scientist by profession, I have to keep my knowledge sharp on new and various topics all the time, and I am always on the lookout for making this process more efficient.

One topic that I’ve only heard about very often, but do not have any pre-existing knowledge on is A/B Testing, and that’s just due to the fact that I have not needed it so far in any of my work, as it’s more popular for “product data scientists” and data scientists working in Marketing & Advertising in general. Nevertheless, I’ve never gotten around to dig deeper into it. Today, I will be attempting and logging a live experiment here, where I’ll attempt to learn as much as I can (from absolute scratch) about A/B testing, relying solely on ChatGPT, in less than an hour. I’ll put my knowledge to the test afterwards, and apply it practically.

Regardless of the outcome, I think this would give us a better insight on how effective ChatGPT could be in education, up-skilling, and learning new topics & skills from scratch.

TL;DR: I attempt to ask ChatGPT as many questions as possible about A/B testing starting from scratch in 1 hour, and I attempt to analyze the results of an A/B test conducted on a game, which is available on Kaggle, I managed to grasp the basic concepts of A/B testing and apply it successfully to a dataset.

Here’s a documentation of my 1 Hour learning session with ChatGPT.

Q1: Please explain in the most basic terms what A/B Testing is

A/B testing, in the most basic terms, is a way to compare two different versions of something to find out which one works better. You show version A to one group of people and version B to another group, then measure how well each version achieves a specific goal, like getting more people to click a button or buy a product. Based on the results, you can decide which version is more effective and use that one moving forward.

Q2: What are the typical use cases for A/B testing?

A/B testing is typically used to optimize and improve various aspects of digital products, marketing campaigns, and user experiences. Here are some common use cases for A/B testing:

Website design: Testing different layouts, colors, fonts, or navigation elements to improve user experience and increase engagement or conversions.

Landing pages: Comparing different headlines, images, copy, or call-to-action buttons to optimize conversion rates for sign-ups, purchases, or other desired actions.

Email marketing: Testing various subject lines, email content, images, or send times to enhance open rates, click-through rates, and overall campaign effectiveness.

Mobile app interfaces: Evaluating different designs, navigation structures, or features to improve user satisfaction, engagement, and retention.

Advertising: Comparing ad creatives, headlines, images, or targeting options to optimize ad performance, click-through rates, and return on ad spend.

Pricing strategies: Testing different pricing plans, promotional offers, or discount structures to maximize revenue and customer acquisition.

Product features: Assessing the impact of new features, functionality, or design changes on user engagement, satisfaction, and retention.

These use cases highlight how A/B testing can be employed across various domains to make data-driven decisions, optimize resources, and enhance overall performance.

Q3: What skills do I need to have to be able to design A/B experiments?

Understanding of the scientific method: A strong grasp of the scientific method and experimental design principles is essential to create controlled experiments, form hypotheses, and draw meaningful conclusions.

Statistical knowledge: Familiarity with basic statistical concepts such as significance levels, sample sizes, confidence intervals, and hypothesis testing helps you analyze results and make data-driven decisions.

Data analysis: Proficiency in analyzing data using tools like Excel, Google Sheets, or programming languages like Python or R is crucial for processing and interpreting experiment results.

Web analytics: Knowledge of web analytics tools like Google Analytics or Adobe Analytics is essential for tracking user behavior and measuring the impact of your A/B tests.

Technical skills: Basic HTML, CSS, and JavaScript skills can be useful for implementing and troubleshooting A/B tests, particularly if you’re working with web pages or applications.

Familiarity with testing tools: Experience with A/B testing tools such as Optimizely, VWO, or Google Optimize helps you design, execute, and analyze experiments more efficiently.

Problem-solving: The ability to identify issues, generate hypotheses, and devise creative solutions is vital for designing effective A/B tests.

Attention to detail: Meticulousness in designing, implementing, and analyzing experiments ensures accurate results and reduces the likelihood of errors.

Communication skills: Clear and concise communication is essential when sharing experiment goals, hypotheses, results, and recommendations with stakeholders.

Curiosity and adaptability: A curious mindset and willingness to learn and adapt to new tools, techniques, and best practices contribute to successful A/B testing.

Having these skills will enable you to design and conduct A/B experiments effectively, leading to data-driven insights and improved decision-making.

Q4:Can you give me an example problem where A/B testing would be useful, and how you as a “Data Scientist” would solve it?

Problem: An e-commerce company wants to increase the conversion rate on its product pages. They believe that changing the “Add to Cart” button’s color might have an impact on the number of users who click the button and make a purchase.

As a data scientist, here’s how I would approach solving this problem using A/B testing:

Define the objective: The primary goal is to increase the conversion rate, which is the percentage of users who click the “Add to Cart” button and make a purchase.

Form a hypothesis: For example, the hypothesis could be that changing the “Add to Cart” button color from green to red will result in a higher conversion rate.

Select key performance indicators (KPIs): The main KPI in this case is the conversion rate, but you might also track other metrics such as click-through rate or average order value to gain additional insights.

Choose the control and variant groups: The control group will see the existing green “Add to Cart” button, while the variant group will see the red “Add to Cart” button.

Calculate the required sample size: Determine the appropriate sample size for each group based on your desired statistical significance, power, and minimum detectable effect.

Randomly assign users: Randomly assign an equal number of users to the control and variant groups to ensure unbiased results.

Conduct the experiment: Run the A/B test for a predetermined period or until you’ve reached the required sample size. Make sure to avoid any changes to the website during the testing period to prevent confounding factors.

Analyze the results: Compare the conversion rates between the control and variant groups using statistical tests, such as a t-test or chi-square test, to determine if there’s a significant difference.

Draw conclusions: If the red “Add to Cart” button results in a significantly higher conversion rate, you might conclude that the hypothesis is supported and consider implementing the change on the website. If there’s no significant difference or the green button performs better, you might reconsider the hypothesis or test other factors that could impact the conversion rate.

Share findings and recommendations: Communicate the results of the A/B test to stakeholders and provide actionable insights and recommendations based on the findings.

Remember that A/B testing is an iterative process, and you might need to conduct multiple tests to optimize the conversion rate effectively. This example demonstrates how a data scientist would approach an A/B testing problem and use the results to inform data-driven decisions.

Q5: Could you elaborate more on step number 5? how could I decide on the sample size?

Deciding on the sample size for an A/B test is an essential step to ensure the validity and reliability of your results. To determine the appropriate sample size, you’ll need to consider three primary factors: statistical significance, statistical power, and minimum detectable effect.

Statistical significance: This is the probability of detecting a true difference between the control and variant groups when one actually exists. Commonly used significance levels are 0.05 (5%) or 0.01 (1%), which correspond to a 95% or 99% confidence level, respectively.

Statistical power: This is the probability of correctly rejecting the null hypothesis when it’s false, or the likelihood of detecting a true effect. A typical value for statistical power is 0.8 (80%), which means there’s an 80% chance of detecting a true difference if it exists.

Minimum detectable effect (MDE): This represents the smallest effect size (difference between the control and variant groups) you want to be able to detect in your test. The MDE should be meaningful and impactful for your business goals. Smaller MDEs require larger sample sizes to detect, while larger MDEs require smaller sample sizes.

To calculate the required sample size, you can use a sample size calculator designed for A/B testing, such as those provided by Optimizely, Evan Miller’s calculator, or use statistical software like R or Python libraries (e.g., statsmodels).

Here’s a general outline of the process:

Choose the desired statistical significance level (alpha), typically 0.05 or 0.01.

Select the desired statistical power, commonly 0.8.

Estimate the baseline conversion rate (current conversion rate) for the control group.

Define the minimum detectable effect as a percentage of the baseline conversion rate.

Input these values into a sample size calculator or statistical software to calculate the required sample size per group (control and variant).

Remember that the calculated sample size is for each group (control and variant), so the total number of users in the experiment will be twice the calculated sample size.

Keep in mind that larger sample sizes increase the likelihood of detecting smaller effects, but they also require more time and resources to collect. Therefore, it’s essential to strike a balance between the desired sensitivity and the practical constraints of your A/B testing scenario.

Q6: How can I analyze the results I get from an A/B testing experiment on a specific KPI? to know whether the results are stastically significant?

To analyze the results of an A/B testing experiment for a specific Key Performance Indicator (KPI) and determine whether the results are statistically significant, you can follow these steps:

Define the KPI: Clearly define the KPI you want to analyze (e.g., conversion rate, click-through rate, average session duration).

Choose the appropriate statistical test: Select a statistical test based on the nature of your KPI (categorical or numerical) and the distribution of your data (normal or non-normal).

For categorical KPIs (e.g., conversion rate): Use the two-proportion z-test or Chi-square test.

For numerical KPIs (e.g., average session duration) with normal distribution and equal variances: Use the independent two-sample t-test.

For numerical KPIs with non-normal distribution or unequal variances: Use the Mann-Whitney U test.

Set a significance level: Choose a significance level (alpha) for your test, which is typically 0.05 or 0.01. This value represents the probability of rejecting the null hypothesis when it is true (Type I error).

Perform the chosen statistical test: Use a statistical software or programming language (e.g., Python, R) to perform the chosen test on your data. This will provide you with a test statistic and a p-value.

Interpret the results: Compare the p-value obtained from the test with your chosen significance level (alpha). If the p-value is less than alpha, the difference between the two groups (A and B) is considered statistically significant, meaning that the observed difference in the KPI is not likely due to chance. If the p-value is greater than or equal to alpha, the difference is not statistically significant, and you cannot confidently conclude that one group is better than the other based on the available data.

Consider practical significance: In addition to statistical significance, it’s essential to evaluate the practical significance of the results. Even if a difference is statistically significant, it may not be practically meaningful if the effect size is very small. Calculate the effect size (e.g., Cohen’s d for the t-test, odds ratio for the Chi-square test) and determine if the observed difference is large enough to justify implementing the change in your product or process.

Remember to also account for factors like sample size and test duration when interpreting the results of your A/B test, as they can influence the validity of your findings.

6 Questions & 1 hour later of asking questions & digesting all of the answers, I think that I now have a basic understanding of A/B testing, what the necessary measures are to conduct an A/B test, and how to analyze the results.

In order to test my knowledge, I decided to try to analyze this dataset that I found on Kaggle which contains the results of an A/B test on users of a mobile game.

Attributes in the dataset:

  • userid (Numeric): unique identifier of the user playing the game
  • version (Categorical/Boolean): describes whether the user was shown the “control group” (gate_30, where the user is shown a gate at level_30) version, or whether the user was a part of the test group (gate_40, where the user is shown the gate at level_40).
  • sum_gamerounds (Numeric): The number of games played by the player during the first 14 days after installing it.
  • retention_1 (Boolean): Did the user re-play the game 1 day after installing it?
  • retention_7 (Boolean): Did the user re-play the game 7 days after installing it?

The three questions that we could potentially answer from this dataset are the following (from my understanding so far):

  • Whether changing the gate to be shown at level 40 (version=gate_40) in the game has a statistically significant impact on the number of game rounds played by the user in the first 14 days after installing it (“sum_gamerounds”)?
  • Whether changing the gate to be shown at level 40 (version=gate_40) in the game has a statistically significant impact on whether or not the user re-plays the game 1 day after installing it (“retention_1”)?
  • Whether changing the gate to be shown at level 40 (version=gate_40) in the game has a statistically significant impact on whether or not the user re-plays the game 7 days after installing it (“retention_7”)?

I’ll be attempting to answer those questions to the best of my understanding using python.

Exploratory Data Analysis (EDA)

First of all I will try to understand, if there’s a difference that is “significant” (not necessarily statistical) for each of the KPIs, by plotting the differences in basic graphs, to get a better feeling of the data:

Comparison of percentage of users who returned after 1 day of installing the game (left) & 7 days of installing the game (right) for the control group “gate_30” and the test group “gate_40”

Here are some stats using, pandas “describe” function, to show some basic stats of the column “sum_gamerounds” for the control group (gate_30):

count    44700.000000
mean 52.456264
std 256.716423
min 0.000000
25% 5.000000
50% 17.000000
75% 50.000000
max 49854.000000
Name: sum_gamerounds, dtype: float64

Same stats for the “sum_gamerounds” for the test group (gate_40):

count    45489.000000
mean 51.298776
std 103.294416
min 0.000000
25% 5.000000
50% 16.000000
75% 52.000000
max 2640.000000

First guess from EDA (without using any statistics): There does not seem to be much benefit that could potentially be gained by changing the gate from level 30 to level 40 for all three KPIs measured, as the numbers seem to be very similar. My first intuition as a data scientist (with no particular experience in A/B testing) would be that we do not even have to do any statistical tests further, but rather look into either trying out a new hypothesis to retain the user or changing the KPIs.

For the sake of this exercise however, I’ll go about conducting the statistical tests, as recommended by our good friend ChatGPT, and see what the results look like.

My feeling, after analyzing the results…

Statistical Significance Tests

We’ll start with our 2 categorical KPIs, namely “retention_1” & “retention7”.

According to ChatGPT’s answer “For categorical KPIs (e.g., conversion rate): Use the two-proportion z-test or Chi-square test.”

Since I am more familiar with Chi-square test, and how to conduct it in python using scipy, I’ll go ahead & conduct the test for both KPIs:

kpi = "retention_1"
#creating the contingency table containing the counts of both groups against the True/False Value counts of the KPI
contingency_table = df.groupby(['version', kpi])['userid'].nunique().unstack()
chi2, p_value, _, _ = chi2_contingency(contingency_table)
alpha = 0.05 # Set the significance level

if p_value < alpha:
print("The difference in proportions of 'True' values between the categories is statistically significant (p-value =", p_value, ")")
else:
print("The difference in proportions of 'True' values between the categories is not statistically significant (p-value =", p_value, ")")

I ran it once for the KPI “retention_1” and another time for the KPI “retention_7” and p_value threshold = 0.05.

The results came through as follows:

  • The difference for “retention_1” is not statistically significant. This means we cannot reject the null hypothesis, meaning that we cannot say whether or not there is any impact of the introduction of the gate at level_40.
  • The difference for “retention_7” is actually statistically significant. however the percentage of the users that came back after 7 days is more in the control group, rather than the test group. This means that we could confidently say that it’s better for the users’ retention after 7 days to keep the gate as it is in level 30.

For the numerical KPI, namely “sum_gamerounds”. Here were ChatGPT’s guidelines:

For numerical KPIs (e.g., average session duration) with normal distribution and equal variances: Use the independent two-sample t-test.

For numerical KPIs with non-normal distribution or unequal variances: Use the Mann-Whitney U test.

I’ve tested for Normality both visually and using the shapiro test for normality, and the values are definitely not normally distributed. So we’ll go with the Mann-Whitney U test. I asked ChatGPT to even write the code this time to conduct this test, and I adjusted it a bit to fit the data.

u_stat, p_value_u = stats.mannwhitneyu(df[df['version']=='gate_40']['sum_gamerounds'], df[df['version']=='gate_30']['sum_gamerounds'], alternative='two-sided')

# Check for statistical significance
if p_value_u < alpha:
print("Statistically significant")
else:
print("Not statistically significant")

The results are in: “Not statistically Significant”.

Final thoughts & Verdict

This was a very efficient learning experience for me, as I feel much more confident to talk about A/B testing to someone, and understand the basics of what is exactly an A/B test, when is it suitable, how to design it and how to interpret its results. I got all of this understanding in 1 hour, by asking the exact questions that I wanted to understand, and I also applied this knowledge on a dataset in another hour, and I am not exactly the “fastest learner.”

I personally see huge potential in how ChatGPT could revolutionize learning & education at all levels. It gives the user the ability to have a customizable learning experience, and could help generate code/instructions for how to conduct a specific task, it even allows the user to ask follow up questions. Imagine how helpful this would be for someone who is more introverted and refrains from asking questions in a formal “classroom” setting. In my opinion, this could even help “democratize” knowledge and education to a great extent, and provide way more resources to the socially disadvantaged

On the other hand, in its current version, ChatGPT’s knowledge cannot be taken for granted always. As it tends to make some pretty fundamental mistakes every once in a while, and the dangerous part about it is, that it tends to be right about 99% of the time, making it really hard to tell when it is absolutely off. I am quite sure however, that there will be a lot of customized solutions based on GPT-4 & similar models that introduce solutions to similar challenges.

Disclaimer: I attempted to learn the basics of A/B testing from scratch, however given my experience as a data scientist, and having a good feeling for data and statistics definitely gave me a huge edge in picking this up quickly.

If you liked this article, make sure to follow me & this publication, and have a look at my previous article, where I attempt to solve programming problems from HackerRank using ChatGPT, and stay tuned for my upcoming ChatGPT experiments!

--

--