Three tables every analyst needs
Dan is the Founder of Outlier, your fractional data team | weareoutlier.com
Introduction
If you’re one of the first analysts joining a company, you will probably arrive to find an overwhelming amount of data. You’ll almost certainly have some relational data (probably powering your application’s backend), and likely have event data too (possibly through a third party like Segment).
This can make it difficult to know where to start with each analytical request. You might be tempted to create a new dataset for each analysis, but this will add to the amount of data and quickly becomes difficult to manage.
Fortunately, you can gather all of your data into three tables that can then be used to answer the majority of user related questions.
The SQL snippets in this post apply to Google BigQuery but can easily be adapted for other data warehouses.
Table 1: User Stats
To start, you need a table that tells you all about your users. This table will contain dimensions (e.g. user account creation timestamp) and metrics (e.g. user lifetime transaction count). It might be simple to begin with, but as your company grows and gets more complex, you’ll keep adding to this table.
Key point: user_stats
has one row per user
SELECT * FROM user_stats
What kind of questions can this table be used to answer?
How many users signed up this week?
SELECT
COUNT(*) as user_count
FROM
user_stats
WHERE
created_at >= TIMESTAMP_SUB(current_timestamp, INTERVAL 7 DAY)
How many users have made at least 5 transactions?
SELECT
COUNT(*) as user_count
FROM
user_stats
WHERE
transactions_lifetime >= 5
How many users have logged in within the past 24 hours?
SELECT
COUNT(*) as user_count
FROM
user_stats
WHERE
last_seen_at >= TIMESTAMP_SUB(current_timestamp, INTERVAL 24 HOUR)
Table 2: Daily user stats
The need for this table might not be so obvious, but it turns out that having a table describing each user’s activity every single day since they signed up (even if they were not active at all on that day) is really useful. There are four dimensions you’ll definitely want in this table:
user_id
: a unique identifier for your users, that you can link back touser_stats
date
: what date does this row correspond today_n
: what day (in the lifetime of the user) does this date represent. On the day a user signs upday_n = 1
, the day after sign upday_n = 2
etcwas_active
: was the user active on this day? (Exactly how you choose to define active should depend on the specifics of your business. If you are creating a messaging app, you may define "active" as having sent at least one message, for example)
As with user_stats
, this table will probably end up with many more than these 4 fields as your business becomes more complex.
Key point: daily_user_stats
has one row per user per day (for all days since they signed up)
SELECT * FROM daily_user_stats ORDER BY user_id, day_n
What kind of questions can this table be used to answer?
What percentage of users are active one week after they signed up?
SELECT
AVG(was_active) as active_user_ratio
FROM
daily_users_stats
WHERE
day_n = 8
How has the percentage of users active the day after they sign up trended over time?
SELECT
DATE(user_stats.created_at) AS created_date,
AVERAGE(was_active) as active_user_ratio
FROM
daily_user_stats
INNER JOIN user_stats
ON daily_user_stats.user_id = user_stats.user_id
WHERE
day_n = 2
ORDER BY created_date ASC
What does our user retention curve look like?
SELECT
day_n,
AVERAGE(was_active) as active_user_ratio
FROM
daily_user_stats
INNER JOIN user_stats
ON daily_user_stats.user_id = user_stats.user_id
WHERE
user_stats.created_at >= TIMESTAMP_SUB(current_timestamp, INTERVAL 7 DAY)
AND day_n <= 7
Table 3: Sessions
The sessions
table is a place to store your event data. Events are user interactions on your app or site (e.g. button clicks, form submissions, message sends etc). You probably track many different events, and finding a consistent approach to analysing them can be tricky. Sessions is a table to organise the chaos!
A session is a collection of events that represent one end to end interaction with your product: user opens app > user opens product abc page > user clicks add to cart > user closes app. How you combine a user’s events into a session depends on your product, but a standard approach is to end a session after a period of 30 minutes of inactivity. Any subsequent events from that user will be added to a new session.
Using some relatively complex SQL, you can split your raw events into sessions. Your sessions table should at least have these dimensions:
user_id
: unique user identifiersession_id
: Starting at 1 (the user's first ever session), incrementing for each further sessionstart_time
: timestamp of first event in sessionend_time
: timestamp of last event in sessionevents
: ordered list/array of all events in the session (this is well supported in BigQuery, other query languages may need to approach this slightly differently, for example using JSON)
Key point: sessions
has one row for each session a user has had
SELECT * FROM sessions
What kind of questions can this table be used to answer?
How has average session length changed over time?
SELECT
COUNT(*) as user_count
FROM
user_stats
WHERE
created_at >= TIMESTAMP_SUB(current_timestamp, INTERVAL 7 DAY)
What share of sessions contain a transaction?
SELECT
COUNT(*) as user_count
FROM
user_stats
WHERE
transactions_lifetime >= 5
What chain of events lead to a user leaving the app and never returning?
SELECT
COUNT(*) as user_count
FROM
user_stats
WHERE
last_seen_at >= TIMESTAMP_SUB(current_timestamp, INTERVAL 24 HOUR)
Summary
These three well structured tables can help answer most questions:
user_stats
: who is in my user basedaily_user_stats
: how many active users do I have, and how well do they retainsessions
: what journey are users taking through my product
Once you’ve written the queries to combine your other datasets into these three simple tables you should set up a schedule to keep them up to date with any new changes. You’ll soon start relying on these tables for a large share of your analytics.
Dataform provides a fully managed, collaborative workspace to set up schedules and manage your data warehouse. For more information on Dataform check out our documentation, or create an account for free and start using Dataform’s fully managed Web platform.
Originally published at https://dataform.co.