Sampling and representativeness

4 min readApr 10, 2022

When we are evaluating the results of a campaign or planning to conduct an A/B test, one of the important questions to ask ourselves is — how representative is this sample? The representativeness of a sample is its ability to most accurately show the properties of the general population.

A simple example would be to analysis of organic traffic and attempt to apply the findings to paid traffic. Organic users, due to various factors, will differ from the paid ones and cannot explain the totality of all players.

To answer the question of how representative your sample is, you need to decompose the available information into user characteristics and make sure that all user groups are taken into account when analyzing or planning.

Let’s look first at various sampling methods:

Random

Each object has an equal chances to be selected. This method is easy to implement, but it has the disadvantage that users can vary greatly in their characteristics. For example, you can take significantly more non-paying users than paying users for an experiment where paying behavior would be important.

Systematic

In this case, there is a problem that if there is a certain pattern in the series of users, then there is a chance to select the majority of irrelevant users. For example, you decided to take 5 000 users, but the sample included a month when you had a large share of low-quality traffic. Besides, you still don’t take into account their characteristics.

Cluster

Selection based on user groupings. For example, you decide to consider all the users of realm 1 among your realm set.
This method can be quite accurate, but you need to take into account the characteristics of the group and be sure that it is similar to others if you want to draw conclusions based on it for the entire population. For example Asian realm could be different from European.

Stratified Sampling

This method is based on user characteristics and ensures that all types of users are included in your sample according to their weights in the population.For instance, we decided to select users by a combination of country, paying or not, and character level. We need to estimate the shares of these users in the total population and choose a sample that would have the same shares of these users. This method may be the most complex, but the probability of getting false conclusions will be less than in the others.

Representativeness

Imagine that we have a some sample. How to check statistically that the sample is representative? The good solution could be an A/A test , which will divide the sample into two groups and compare them with each other.

We need to get an answer to two questions:

What is the probability of getting a difference when there really is none. Also called p-value. Usually we accept it as 5% of false results.
How likely is it to see a difference when there really is one? Also called as sensitivity. Standard value is 80%

For example, let’s say we want to compare players retention on the first day.

The algorithm for getting the answer to the first question will be:

Divide the group into A1 and A2. Let’s say each has 2500 users.
Iterate comparisons. Take a sub-group of X people (100 for example) from A1, compare their retention to sub-group of 100 people of A2. Using this method, we simulate the number of daily users. Perform this for N comparisons. Bootstrap method is one way of doing that.
In each comparison, use several statistical methods at once. For example, t-test and Mann-Whitney in addition to bootstrap. (I will talk about their features and use cases in a future articles)
Make sure that the percentage of cases where there are differences does not exceed 5%. If the value is higher, then your sample is not homogeneous and should not be used for A/B test.

To determine the sensitivity, we repeat the same steps, but in group A1 we will increase the value of the metric by a certain step, let’s say by 0.25%

For example, if in the first iteration of sub-group we received a retention value of 35% for A1, then we will add 0.25% to it and calculate its p-value compared with the unchanged value of A2.

After a series of iterations, we should get a matrix of values, the step of increasing the effect and the statistical significance at which the change appears.
For example, we can get that significance only appears at 7% change and X number of users (sub-group x number of iterations), which is difficult to achieve for a first day metric.

In case our metric is not highly sensitive, it would be better to look into proxy metrics, for example, to the tutorial progress funnel and try to repeat the same steps for it.

In addition, there are methods for reducing dispersion that can also help, but we will talk about them in future articles.

Published:

Upcoming:

How to identify the correlation between events. An example based on user’s actions on the first day and their impact on retention
What types of data do we usually work with in mobile games
Practical examples of the choice of statistical criteria
Bootstrap method. How to identify statistical significance on a limited date. What are its advantages
How to reduce the date accumulation time in a/b tests

Sampling and representativeness

Representativeness

Written by Ruslan Valeev