Statistical Significance
WHY
A numeric result has statistical significance when it is very unlikely to have occurred by natural variation (aka chance)
WHEN
In general, required on every numeric result, e.g.:
- Comparison of 2 cohorts
- A measure change (is it natural variation or real shift in a metric?)
- Summarization of data: average, median
HOW
UseCase1
CohortA average is higher than CohortB average, but is it statistically significant ?
Tool: https://www.evanmiller.org/ab-testing/t-test.html Note that it also supports entering summary values (mean, std, usersCount) instead list of raw values.
UseCase2
Does the rate of success differ across two groups? (aka Proportion testing, very commonly used in A/B testing)
Tool: https://www.evanmiller.org/ab-testing/chi-squared.html
UseCase3
A metric trend has changed, is it just a natural (expected) variation? Or is in fact such an unlikely value (given its history) that it’s a real shift?
Tool: Is trend change significant? Time Series Analysis: CausalImpact (R tool), use a nonbiased trend as inputs, and get a validation whether change in trend is significant.
UseCase4
CohortA median is higher than CohortB median, but is it stat. significant ?
Tool: Moods http://www.real-statistics.com/non-parametric-tests/moods-median-test-two-samples/ or Wilcoxon http://www.real-statistics.com/non-parametric-tests/wilcoxon-rank-sum-test/ (or Mann-Whitney for >2 groups http://www.real-statistics.com/non-parametric-tests/mann-whitney-test/)
UseCase5
How many users are needed to be confident that a change is significant? Very commonly needed for A/B tests, but useful for getting a feeling for cohorts minimum size.
Tool: https://www.evanmiller.org/ab-testing/sample-size.html
UseCase6
Average is x, but variation within the original list of values is big, so how confidence we are on it ?
Tool: Result is average x, but with expected variation of: [x-1.96*std, x+1.96*std]
WHAT CAUSED IT ?
Isolate potential cases as hypothesis and test them.
A/B testing method: isolate/randomize every other factor, like seasonality, time of day, type of user, etc.. except the hypothesis being tested, like a user UX change, so that when a significant change has happened we know is surely caused by UX change.