Unpack Hierarchical Data with Recursive CTEs: An Attempt at an Intuitive Explanation

The niche SQL trick with surprisingly common use cases

Charlotte Meng
Dandy Engineering, Product & Data Blog
4 min readJan 13, 2023

--

Photo by Sunder Muthukumaran on Unsplash

One of the coolest things I’ve learned while growing my SQL muscles at Dandy has been the recursive CTE.

Its implementation isn’t intuitive; its use cases not immediately apparent. But, I promise you, it has come in handy for gathering critical pieces of information for our business.

Take for example employees in a company, nested in a hierarchy of managers. Often the only data point you have for each employee is just who they report to. (Maybe your employee data is more sophisticated than this — but keep reading. You’ll soon realize this is a common problem in other raw datasets, that you probably tried solving with a Jinja loop in your existing SQL.)

Excluding the CEOs, you’ll inevitably want to see who is the highest manager on the chain of employees. Maybe your company doesn’t yet have access to tools that can visualize hierarchies for you, and you need a quick and easy way to grab this data.

Well trust me, the recursive CTE is quick and easy.

To give you a conceptual understanding of the recursive CTE (and feel free to skip this part if you just want to know the code so you can plug and chug), it resembles a proof by induction. I.e. in order to prove something that exhibits an obvious pattern, you start by seeding it with initial conditions. Maybe you want to prove that the sequence generated by the recursive formula

the generated sequence

is determined by the explicit rule

(Where n = 0,1,2,3,…)

The first step is to state the first term of the sequence (the “initial conditions”, let’s say). Let x_1 be 1. We then take on the hypothesis that indeed x_n=2^n is true. If that’s true THEN it must also be true that x_(n+1) equals 2^(n+1), and we bear the onus of showing this. It turns out that x_{n+1} = x_n*2 = 2^n*2=2^(n+1), and so we have proved by induction that the recursive rule x_n = 2(x_{n-1}) generates the same sequence as x_n = 2^n.

Understanding how recursion works is critical to understanding how the recursive CTE works. Like the proof by induction, you create a CTE with an “initial conditions” seeding and then declare a recursive condition. In the recursive CTE, the recursive condition is an INNER JOIN.

So let’s go back to our employee hierarchy example:

Employee A reports to Employee B.

Employee B reports to Employee C.

Employee C reports to no one.

We don’t know how far back the chain goes, but we are given the chaining rules.

WITH employees AS (
SELECT * FROM database.schema.employees
)

, unpack_hierarchy AS (
SELECT
employee_id AS topmost_employee_id
, employee_id
, NULL AS reports_to
, 1::INTEGER AS nth_in_sequence
FROM employees
WHERE reports_to IS NULL
--this is the initial condition, we seed the CTE with the first employee in the hierarchy

UNION ALL

SELECT
unpack_hierarchy.topmost_employee_id
, employees.employee_id
, employees.reports_to
, unpack_hierarchy.nth_in_sequence + 1 AS nth_in_sequence
FROM unpack_hierarchy
INNER JOIN employees
ON unpack_hierarchy.employee_id = employees.reports_to
--this inner join clause is the recursive condition
)

To start the recursive CTE, we give the initial condition. This might either be the topmost employee or the bottommost employee, depending on what data you have available to you. Because of the specific chaining rules we’re given here, it’s easiest to identify the topmost employee since they don’t report to anyone.

Then we union the recursion onto the initial condition. The recursive rule is to join the CTE onto itself wherever its employee_id has a matching reports_to in the employees table. It will keep joining onto itself until the INNER JOIN stops returning rows, making sure that the table will only loop for as many times as necessary, despite the system not knowing how long the chain of command will be.

Because every INNER JOIN preserves the topmost_employee_id, we can do fun window functions like

, levels_of_command AS (
SELECT
*
, MAX(nth_in_sequence) OVER (PARTITION BY topmost_employee_id) AS levels_of_command
FROM unpack_hierarchy
)

so we know exactly how far back the chain goes for every employee.

This has proven to be the most elegant solution whenever we needed to unpack hierarchical data at Dandy. Need to track a chain of orders for something that had to be remade several times? Recursive CTE. A bunch of Salesforce IDs are linked to each other sequentially but you need to know the last one? Recursive CTE. Hierarchical information is a necessary way to organize the world, and the recursive CTE is the most natural choice for processing that data. You can also get fairly creative with the recursive CTE, so I encourage you to adopt it into your repertoire of SQL tools. It’ll come in handy one day, I’m sure.

--

--