📈 How to build cohorts with SQL

Published in

Cheerz Engineering

5 min readNov 21, 2018

Here is a short story of how you can extract cohorts with just your PostgresSQL database and a few lines of Ruby.

Cohorts, cohorts, cohorts, cohorts, cohorts…

Cohorts is a word that I have heard a lot. From analysts during fundraisings, from people from marketing, from the product team, from everyone. And actually, it seems quite legit: we all want to know how much our users interact with our platform.

A few years ago however, we didn’t have any fancy BI tool like Looker, Tableau or Periscope to help us extract those cohorts. We only had one PostgreSQL server, a terminal, and the urge to extract those cohorts.

I won’t go too much in the details of what a cohort is and is for, I think that others have gone enough into these kinds of details.

The model

To make things easier, we take here a simplified model of our business logic with:

🙂 One table “user”
🛒 One table “orders” ( ) with a “paid_at” field and a “user_id” reference to the users table.

What we want to have in the end, is cohorts for each month, telling us month by month the percentage of customers who have now turned into repeating customer.

For instance, if we look at the table below, we see that after 4 months, among the customers who made their first order in January, 37% have ordered again at least once, and one month later, 40,4%.

Repeaters cohort : percentage of customers who have > 1 orders

1. Generate the time segments

PostgreSQL has a neat feature called “generate_series” that allows us to generate all the values of an interval.

SELECT generate_series AS start,
        (generate_series + interval '1 month' - interval '1 second')       AS stop
      FROM generate_series('2015-01-01 00:00'::timestamp, '2017-10-30', '1 month')

Thanks to this, with a few lines of code, we get a list of time ranges on which we want to run our study :

For our study, we need to have all the combinations of those timeranges with themselves:

WITH time_serie AS (
  SELECT generate_series AS start,
  (generate_series + interval '1 month' - interval '1 second') AS stop
  FROM generate_series('2018-01-01 00:00'::timestamp, '2018-06-30', '1 month')
  ), 
first_order_serie AS (
  SELECT start AS first_order_start, stop AS first_order_stop
  FROM time_serie
)
SELECT * FROM time_serie, first_order_serie

Et voilà ! 🎉

2. Select the new users for each time range

Now that we have all the time data of our study, for each line of this table, we want to select the users having their first order in the specific interval:

WITH new_customers AS (SELECT orders.user_id
FROM orders
INNER JOIN (
  SELECT user_id
  FROM orders
  WHERE orders.paid_at BETWEEN 
    time_table.first_order_start AND time_table.first_order_stop
) AS period_orders
ON period_orders.user_id = orders.user_id
GROUP BY orders.user_id
HAVING MIN(orders.paid_at) BETWEEN time_table.first_order_start AND time_table.first_order_stop)

3. Count the percentage of repeaters compared to the customers

Now that we have all the customers who made their first order in a given period (in the “new_customers” Common Table Expression), we have to check the percentage of them who have strictly more than 1 order before the end of the studied interval:

SELECT 100 * COUNT(*)::float/(SELECT COUNT(*) FROM new_customers) as repeat_percent 
FROM (
      SELECT COUNT(*), orders.user_id
      FROM orders
      INNER JOIN new_customers 
      ON new_customers.user_id = orders.user_id
      WHERE orders.paid_at <= time_table.stop
      GROUP BY orders.user_id
      HAVING COUNT(*) > 1
) AS REPEATS

4. Wrap the SQL together 📦

Now that we have all the bricks, we just need to wrap them up together:

SELECT time_table.*, (WITH new_customers AS (
    SELECT orders.user_id
    FROM orders
    INNER JOIN (
      SELECT user_id
      FROM orders
      WHERE orders.paid_at BETWEEN time_table.first_order_start AND time_table.first_order_stop
    ) AS period_orders ON period_orders.user_id = orders.user_id
    GROUP BY orders.user_id
    HAVING MIN(orders.paid_at) BETWEEN time_table.first_order_start AND time_table.first_order_stop
  )SELECT 100 * COUNT(*)::float/(SELECT COUNT(*) FROM new_customers) as repeat_percent 
  FROM (
    SELECT COUNT(*), orders.user_id
    FROM orders
    INNER JOIN new_customers 
    ON new_customers.user_id = orders.user_id
    WHERE orders.paid_at <= time_table.stop
    GROUP BY orders.user_id
    HAVING COUNT(*) > 1
  ) AS REPEATS)FROM (
  WITH time_serie AS (
    SELECT generate_series AS start,
      (generate_series + interval '1 month' - interval '1 second') AS stop
    FROM generate_series('2018-01-01 00:00'::timestamp, '2018-06-30', '1 month')
  ), first_order_serie AS (
    SELECT start AS first_order_start, stop AS first_order_stop
    FROM time_serie
)
SELECT * FROM time_serie, first_order_serie) AS time_table

5. Export the data

A matter of seconds with the great “COPY” function that Postgres provides.

Just wrap the previous request in a COPY block and download it (or use the client-based \COPY):

COPY (
...
) TO 'output.csv' WITH CSV HEADERS ';';

6. Unflatten the table

Almost done! We just need to re-arrange the data, to go from a flatten table to the output we expect.

Here, we do it with a few lines of Ruby:

require 'CSV'
require 'time'cohort_lines = CSV.
  read('input.csv', col_sep: ';', headers: true).
  map{|line| line.to_h }months = cohort_lines.map{ |row| row['registration_start'] }.uniq.sorttime_formatter  = ->(str){ Time.parse(str).strftime('%B-%Y') }
float_formatter = ->(str){ sprintf('%.1f', str.to_f).gsub('.', ',') }CSV.open('output.csv', 'wb', col_sep: ';') do |csv|
  csv << [nil] + months.map{ |str| time_formatter[str] }
  months.each_with_index do |month, i|
    columns = cohort_lines.
      select{ |r| r['registration_start'] == month }.
      sort_by{ |r| r['start']}.
      map{|r| float_formatter[r['repeat_percent']] }.
      drop(i)
    csv << [time_formatter[month]] + columns
  end
end

Tadaaam! 🎉🎉🎉🎉

‍🕵 ️To go further

Once you have this very simple extract working, you still have plenty in front of you.

What we have done after that with this system:

Automate the extract so that it runs every month 🕐
Display cohorts of revenue instead of percentages
Filter the cohort by country, device, product, and plenty of other dimensions

The only limit then is the imagination of people you work with!

This way of extracting cohorts might not be the most efficient technique in terms of the CPU you use, but it has the advantage of being simple and easy to customize!

🧠 Conclusion

There are thousands of tools on the market right now to help you better understand your users.

A lot them are very powerful, and really allow non-technical people to analyze user behaviors the way they need. But almost none of these tools are free, and almost all of them are even very expensive. If you can afford the tool that fits your need, fine for you.

But for the others, remember that with a few dozen lines of code, you can still get great results and priceless insights!