Outliers and anomalies in game analytics

Ruslan Valeev
4 min readApr 7, 2022

--

Analyzing games data sometimes you will encounter values that go beyond the normal behavior of the system or the player. The data of such events will be either outliers or anomalies.

Let’s define what is considered an outlier and what is an anomaly.

Outliers are legitimate observations but they are significantly distant from the other values of data distribution. We could consider them as extreme values of data points.

Outliers example:

  • Single whale payment in the cohort of the paying users
  • Fact of completing a very difficult level on the first try
  • Extremely high CCU

Outliers affect averages and may have a negative impact on your product decision. For instance: you have launched an UA campaign and saw a good ROAS but in fact it was achieved by having one random whale player. Therefore, when analyzing general data, it is worth it to check components it consists of.

Outliers can be excluded from the sample, but this must be a conscious decision and you must understand the reason you decided to exclude this outlier. They increase awareness and knowledge about the product that some cases, although rare, can still occur and need to be taken into account

Now let’s proceed to Anomalies:

Anomalies are events or observations which deviate significantly from the majority of the data and do not conform to an expected pattern

Anomalies example:

  • Incredibly fast level completion
  • Stop the flow of payments (if they were stable before)
  • Sudden increase in values in the funnel where it is logically impossible

Anomalies are often a sign that there is something wrong with your system and needs to be fixed. It can be cheating, problems with your infrastructure or the services you work with.

The easiest way to identify deviations is to plot the values. Let’s take for example a sample of payments. We immediately see the value that stands out from the general distribution

To illustrate the distribution of values, let’s see the quartiles of the data set

Distribution quartiles are quantiles that are multiples of 25%, that is, corresponding to 25%, 50%, and 75% of the dataset values. They are also sometimes called “first”, “second” and “third”, respectively, or “lower”, “middle” and “upper”. We will designate them through “Q1”, “Q2” and “Q3”, respectively.

The second quartile is a useful statistic in its own right, as it shows that 50% of the observations in the sample are below this number, and the rest are correspondingly higher, that is, it actually divides the sample in half. You might have heard of it by the name “median”.

To calculate them using Google Spreadsheets use the =quantile() function. As we see 75% of values (Q3 sum of 25%+25%+25% intervals) are below 2,25.

35$ payment most likely is an outlier (based on this data and with no other historic evidences). Note that 6$ is also a rare case for this sample and it can be a future whale, which is still eyeing the game, or a large dolphin.

In order to work with outliers, let’s set a threshold of values, exceeding which we will consider the case as non-standard. For example, we can say that everything that will exceed the 90% percentile is not a common situation (in this case, a whale). We could use the percentile() function selecting the data set and the required percent value.

This means that 90% of all our values are below 5$. Now you can apply this knowledge to display all values above 5 in the dataset, for example, through conditional formatting.

Percentiles are useful in many cases. For example, automatic alerts on the amount of payments per hour, the number of players online or the detection of cheaters as an simple anomaly detection. Identification of deviations is a separate task of statistics and usually in business orientated cases google spreadsheets replaced by statistical packages, methods such as k-nearest neighbors and machine learning.

Published:

  1. How to make retention model and calculate LTV for mobile game
  2. How to calculate ROI and predictive LTV with the first real data
  3. How to check for data anomalies and outliers
  4. Classification of users. KNN method
  5. How to check the representativeness of the data sample

Upcoming:

  1. How to identify the correlation between events. An example based on user’s actions on the first day and their impact on retention
  2. What types of data do we usually work with in mobile games
  3. Practical examples of the choice of statistical criteria
  4. Bootstrap method. How to identify statistical significance on a limited date. What are its advantages
  5. How to reduce the date accumulation time in a/b tests

--

--