Calculating Retention For Gaming Cohorts

Bryan
Mighty Bear Games
Published in
7 min readApr 3, 2024
Photo by Lukas Blazek on Unsplash

Calculating player retention is something almost any studio can do without needing to rely on third-party vendors. It requires some intermediate Structured Query Language (SQL), and is relatively simple once you have these four components in place. I will later also demonstrate how the queries can easily be modified to get insights about targetted cohorts.

  1. Data
    You need, of course, to be collecting data. These are the key elements that any piece of data should have: an identifier, a timestamp, and the relevant events according to what activating and retaining event you’re interested in.
  2. Cohort time interval
    A cohort is a group of players you’ve identified whose retention behaviour you’re trying to analyze. This is a decision on the period for which you want to group your cohorts. For example, if it’s daily cohorts, then we want to know, on a daily basis, how frequently these groups of players who logged in today come back on Day 1, Day 3, Day 7 (D1, D3, D7), etc.
  3. Activating event
    This is a decision about the minimum activity of a player before counting them in a cohort. It is typically the login event but can be anything from ‘at least 10 minutes spent in matches’ to ‘spent at least $10 in purchases’, as long as you have the data for it.
With this many (and more!) partners for our game, Mighty Action Heroes, it’s even more important to be able to track and analyze our most successful partnerships.

4. Retaining event
This is a decision about what type of behavior you want to track, after having decided who is included in the cohort. For example, you may have released a new, exclusive character that players can use, and the activating event is players who used that character before. Then, the retaining event can simply be login events. Effectively we’re asking, of this cohort of players who have used this exclusive character, what is their login behavior like?

Speaking of characters, have you met the exclusive Hero for Mighty Action Heroes Season 3: Bollywood Brawl?

The following segment will look at how to express this in SQL, and where to make relevant modifications to optimize querying even across multiple activating events. We’ll start with the classic example of daily cohorts, where the activating and retaining events are login events, and we want to see D1, D3, D7, and D30 activity.

In a ‘proper’ setup, we wouldn’t be querying raw_events, but if you’re reading articles on calculating retention, I’ll assume we’re still in the early stages. Note: we’ll be making use of CTEs quite a bit.

WITH raw_login_events AS (
SELECT player_id, timestamp
FROM raw_events
WHERE event_name='login_events'
),

login_events_processed AS (
SELECT cte.player_id,
cte.datetime,
cohorts.first as first,
CAST(EXTRACT(day FROM cte.datetime-first) AS int) AS day_number FROM

(SELECT player_id, datetime
FROM raw_login_events
GROUP BY player_id, datetime) cte,

(SELECT player_id, MIN(datetime) AS first
FROM raw_login_events
GROUP BY player_id) cohorts

WHERE cte.player_id=cohorts.player_id
)

SELECT first AS cohort,
sum(CASE WHEN day_number = 0 THEN 1 ELSE 0 END) AS day_0_actual,
sum(CASE WHEN day_number = 0 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_0,
sum(CASE WHEN day_number = 1 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_1,
sum(CASE WHEN day_number = 3 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_3,
sum(CASE WHEN day_number = 7 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_7,
sum(CASE WHEN day_number = 30 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_30,
FROM login_events_processed
GROUP BY first
ORDER BY first

In this example, we want daily cohorts, so the timestamp here needs to be formatted as YYYY-MM-DD. Here it’s a DATE format, allowing me to do EXTRACT(DAY FROM …), but if all the date data is in integers e.g. 20240229, that works as well, since it is always unique, by day.

The logic happens in three key components:

  1. CTE - This is a ‘squashing’ of all login events. For a given day, for a given player, we only want to know if the player logged in, not the number of times in that day. By doing GROUP BY both player_id and datetime, each row is a unique instance in which a player_id logged in on any given day.
  2. cohorts - We’re looking for the first day we’ve ever seen this player_id, so MIN is used, and we call this the first.
  3. login_events_processed - We now have onhand the results of our CTE, and the cohort of each player_id from first. For any day that a player logged in, we then simply take the difference between the date of login and the first login day, calling it day_number. If it’s the first time, the day_number is 0 (a.k.a D0).
  4. final calculation - We now have data of columns: player_id, datetime, first, day_number. Each row represents a unique day a player logged in. Our retention table will have a row per cohort, and a day of analysis (D0, D1, D3 etc) per column. D0 is the first day the player has ever logged in, giving us the raw cohort size. D1 is then the percentage of all players who, after their first login, came back the next day. This is calculated by taking the SUM of day_number=1 / SUM of day_number=0, grouped by the cohort. We can then repeat this for any number of days we want.

That’s it! Download the output as a .csv, or use a visualization tool, and you can now measure the retention rates of each daily cohort by their login activity!

Now, what if I want to segment this by, say, Android vs iOS?

The key to optimizing these is to use ‘WITH’ statements so the computation of reading the data, JOINing, enriching, etc, is done as few times as possible.

Modifying our original query, we need to add a few things:

  1. The device_type, or whatever equivalent differentiating field
WITH raw_login_events AS (
SELECT player_id, timestamp, device_type
FROM raw_events
WHERE event_name='login_events'
),

What we don’t want, is to immediately filter out the device_types, because that would force us to query the raw_login_events twice!

WITH raw_login_events AS (
SELECT player_id, timestamp
FROM raw_events
WHERE event_name='login_events'
AND device_type = 'android' -> NOT a good idea
),

Instead, with raw_login_events now having the new field, we can go on to separate them with CTEs…

WITH raw_login_events AS (
SELECT player_id, timestamp, device_type
FROM raw_events
WHERE event_name='login_events'
),

cte_android AS (
SELECT player_id, timestamp
FROM raw_login_events
WHERE device_type='android'
),

cte_ios AS (
SELECT player_id, timestamp
FROM raw_login_events
WHERE device_type='ios'
),

…then run the intermediate logic separately, changing the FROM source:

android_login_events_processed AS (
SELECT cte.player_id,
cte.datetime,
cohorts.first as first,
CAST(EXTRACT(day FROM cte.datetime-first) AS int) AS day_number FROM

(SELECT player_id, datetime
FROM cte_android
GROUP BY player_id, datetime) cte,

(SELECT player_id, MIN(datetime) AS first
FROM cte_android
GROUP BY player_id) cohorts

WHERE cte.player_id=cohorts.player_id
),

ios_login_events_processed AS (
SELECT cte.player_id,
cte.datetime,
cohorts.first as first,
CAST(EXTRACT(day FROM cte.datetime-first) AS int) AS day_number FROM

(SELECT player_id, datetime
FROM cte_ios
GROUP BY player_id, datetime) cte,

(SELECT player_id, MIN(datetime) AS first
FROM cte_ios
GROUP BY player_id) cohorts

WHERE cte.player_id=cohorts.player_id
),

Lastly, a neat little trick to combine the queries to reduce the number of queries:

  • Add a field differentiating them (e.g. device_type), then use a UNION
  • This is so we only query raw_events once
  • The output is a single table which can be filtered as needed
android_retention AS (
SELECT device_type AS 'android', first AS cohort,
sum(CASE WHEN day_number = 0 THEN 1 ELSE 0 END) AS day_0_actual,
sum(CASE WHEN day_number = 0 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_0,
sum(CASE WHEN day_number = 1 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_1,
sum(CASE WHEN day_number = 3 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_3,
sum(CASE WHEN day_number = 7 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_7,
sum(CASE WHEN day_number = 30 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_30,
FROM android_login_events_processed
GROUP BY first
ORDER BY first
),

ios_retention AS (
SELECT device_type AS 'ios', first AS cohort,
sum(CASE WHEN day_number = 0 THEN 1 ELSE 0 END) AS day_0_actual,
sum(CASE WHEN day_number = 0 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_0,
sum(CASE WHEN day_number = 1 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_1,
sum(CASE WHEN day_number = 3 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_3,
sum(CASE WHEN day_number = 7 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_7,
sum(CASE WHEN day_number = 30 THEN 1 ELSE 0 END) / sum(CASE WHEN day_number = 0 THEN 1.0 END) AS day_30,
FROM ios_login_events_processed
GROUP BY first
ORDER BY first
)

SELECT * FROM android_retention
UNION
SELECT * FROM ios_retention

With this in mind, it’s a matter of being creative in terms of what other filters, or different activating events you might want. For example, a team might only want to look at cohorts where players spent at least 600 seconds in-game. You would handle this filter after querying raw events, but before calculating cohorts — give it a different cte_name and let it be another filterable output in the final table!

--

--