Classification of users. KNN method

Ruslan Valeev
4 min readApr 8, 2022

--

In the last article, we discussed how to use percentiles to set thresholds. For example, they are useful for simple user classifications by the amount of their payments. Let’s split this random sample by the amount users paid where the stars will be intervals of the 20% percentile.

This is useful to find out how many users of which class do we have and how their classes are changing over time. But what if we want to solve a more complex problem? For example, try to predict the class of the user based on the purchases of his first days.

In this case, user classification methods will help us. There are a fairly large number of different methods, in this article we will consider the simplest machine learning algorithm - k-nearest neighbors. I will use Google Spreadsheets only to demonstrate the logic of that method, in real case scenario this is a task usually of a some server side offer system.

To use that we will need 2 components: historical data and the number of groups for user classification. Let’t take the example above and decompose it into the boolean (yes or no) values by the offers.

I took a sample of only 25 objects for convenience, the more of them the model will be more accurate. Let’s assume that we are looking at paying users who have already lived for a while and as the offer system, so the other monetization features have remained unchanged since that time.

values and offers are given as an example, but I tried to make them logical

KNN method tries to classify an object by a set of similar features by calculating the smallest Euclidean distance between points. The result is calculated using the formula √((x2-x1)²+(y2-y1)²+(yN-yN)²) for each point set and the decision is made on the basis of the most frequently occurring result among the k observations. To find k, you can use formula k=n^(1/2) where n is the number of instances.

Let’s add some random user and calculate the distance between him and 1st user in our data set

We will do this for all rows, in order to more conveniently interpret the results, we will display the rank of each cell of the result relative to all total

Our k is = 25^0.5 =5.

In the case of statistical tools, the number k would be selected iteratively to calculate the most accurate result.

Now let’s look at the values that are stored in the first five ranks. Most common class value is “2 star”, so the user that bought a medium pack of crystals and battle pass most likely will be a 2 star paying user.

We may use this information to offer more personalized offers. It is important to say that the system must be self-learning and must be updated with changes made to your game.

Published:

  1. How to make retention model and calculate LTV for mobile game
  2. How to calculate ROI and predictive LTV with the first real data
  3. How to check for data anomalies and outliers
  4. Classification of users. KNN method
  5. How to check the representativeness of the data sample

Upcoming:

  1. How to identify the correlation between events. An example based on user’s actions on the first day and their impact on retention
  2. What types of data do we usually work with in mobile games
  3. Practical examples of the choice of statistical criteria
  4. Bootstrap method. How to identify statistical significance on a limited date. What are its advantages
  5. How to reduce the date accumulation time in a/b tests

--

--