Understanding Recursive SQL for Hierarchical Data Structures

Published in

Learning SQL

9 min readApr 29, 2024

In database management, hierarchical data structures are common when modeling relationships between entities that possess parent-child associations. One prevalent method to represent such relationships is the tree-like data structures, where each record holds a reference to its parent record within the same table.

To analyze these hierarchical structures efficiently, SQL offers recursive queries, a powerful feature available in modern RDBMS. Recursive queries allow you to iterate over hierarchical data, traversing from parent to child and vice versa, enabling operations such as tree traversal, pathfinding, and aggregating hierarchical information.

In this article, we will explore the usage of recursive queries in PostgreSQL to work with data organized in tree-like structures. We’ll walk through the fundamentals of recursive queries and demonstrate their application in querying hierarchical data, providing a practical example to illustrate one of the possible use cases.

Let’s look at how hierarchical relationships are represented within a database. It involves adding an additional column to a table, typically named something like “parent_id,” which establishes a relationship between records in the same table. This “parent_id” column serves as a foreign key that references the primary key of another record within the same table, indicating the hierarchical relationship between the records. By storing the ID of the parent record on each child record, you can create a hierarchical structure where each record can have one parent and zero or more children.

Picture a personal profile in an application — filled with various details like address, which can be primary or secondary and, in both cases, can contain such components as city, street, etc. In our case, it’s about structuring information with parent-child relationships. Each field in the individual profile — be it “City”, or “Street” — is a node in this tree.

Imagine a single table to hold all the profile fields. Each row in this table describes a single field:

| profile_field_id | field_name           | parent_field_id |
|------------------|----------------------|-----------------|
| 62,434           | Address Line 1       | 70,490          |
| 62,898           | City                 | 70,490          |
| 64,058           | State                | 70,490          |
| 61,274           | Contact Name         | [NULL]          |
| 61,506           | Phone Number         | [NULL]          |
| 61,738           | Relationship         | [NULL]          |
| 61,970           | Address              | [NULL]          |
| 63,826           | Country              | 70,490          |
| 64,759           | Middle Name          | [NULL]          |
| 64,991           | Preferred Name       | [NULL]          |
| 65,223           | Previous Last Name   | [NULL]          |
| 64,292           | Home Phone Number    | [NULL]          |
| 64,524           | Work Phone Number    | [NULL]          |
| 72,679           | First Name           | [NULL]          |
| 72,680           | Last Name            | [NULL]          |
| 72,686           | Date of Birth        | [NULL]          |
| 72,683           | Grade                | [NULL]          |
| 70,490           | Home Address         | 61,970          |

profile_field_id: think of this as a unique ID for each piece of information.
field_name: this is the label or title of the field.
parent_field_id: this indicates the relationship between fields. For example, a "Country" field might have a parent ID linking it to the "Home Address" field.

You have surely noticed that some of the fields have parent_field_id, and they are the ones that form the hierarchy. Now, when we want to fetch a user’s address data from the database, we need to assemble it like a puzzle. Basically, we start with the top-level fields (those without a parent) and then gather their children. As you might have guessed, we should arrange this process recursively until we've built the entire profile. For that, we’ll make use of SQL recursive queries.

Most modern RDBMS support recursive queries, typically using common table expressions (CTEs) with the WITH RECURSIVE clause. Here's a breakdown of the syntax:

WITH RECURSIVE cte_name AS (
    -- Anchor member: The starting point of the recursion
    SELECT columns
    FROM table
    WHERE [condition]
    
    UNION ALL
    
    -- Recursive member: The part of the query that refers back to the CTE itself
    SELECT columns
    FROM table
    JOIN cte_name
     ON [recursive_condition]
    WHERE [condition]
)
-- Main query that utilizes the CTE
SELECT *
FROM cte_name;

In a recursive query, think of UNION ALL as the glue that sticks together different levels of a hierarchy. It’s like adding links to a chain: each link connects to the next, slowly revealing more layers of complexity until the entire hierarchy is disclosed.

When we merge the outcomes from the starting point of our recursion (the anchor member) with the part of the query that refers back to itself (the recursive member), UNION ALL guarantees that every step adds to building the entire hierarchy.

Let’s move on to a practical example start with fetching the necessary data we are going to process recursively. Note that in the initial CTE we will need to use join to the table that stores the values for the fields, without this our hierarchy will be incomplete.

WITH RECURSIVE a as (
     SELECT
      epfv.employee_id,
      pf.profile_field_id,
      pf.name as field_name,
      epfv.field_value,
      pf.parent_field_id
     FROM profile_field pf
     LEFT JOIN employee_profile_field_value epfv
      ON pf.profile_field_id = epfv.profile_field_id
    ),
    
    ...
    ...

| employee_id | profile_field_id | field_name         | field_value      | parent_field_id |
|-------------|------------------|--------------------|------------------|-----------------|
| 11,212      | 62,434           | Address Line 1     | Street           | 70,490          |
| 11,212      | 62,898           | City               | City             | 70,490          |
| 11,212      | 64,058           | State              |                  | 70,490          |
| 11,212      | 61,274           | Contact Name       | Test             | [NULL]          |
| 11,212      | 61,506           | Phone Number       | (029)111-22-33   | [NULL]          |
| 11,212      | 61,738           | Relationship       | Son              | [NULL]          |
| 11,212      | 61,970           | Address            | [NULL]           | [NULL]          |
| 11,212      | 63,826           | Country            | United Kingdom   | 70,490          |
| 11,212      | 64,759           | Middle Name        | Test             | [NULL]          |
| 11,212      | 64,991           | Preferred Name     | Test             | [NULL]          |
| 11,212      | 65,223           | Previous Last Name | Test             | [NULL]          |
| 11,212      | 64,292           | Home Phone Number  | (020)666-55-44   | [NULL]          |
| 11,212      | 64,524           | Work Phone Number  | (029)111-22-33   | [NULL]          |
| 11,212      | 72,679           | First Name         | 110              | [NULL]          |
| 11,212      | 72,680           | Last Name          | 110              | [NULL]          |
| 11,212      | 72,686           | Date of Birth      | 2001-11-11       | [NULL]          |
| 11,212      | 72,683           | Grade              | FY1              | [NULL]          |
| 11,212      | 70,490           | Home Address       | [NULL]           | 61,970          |

You might have noticed the RECURSIVE clause and wondered, where's the recursion in that? Well, the thing is, in SQL, when you use multiple common table expressions (CTEs) in a query, only the first CTE can include the RECURSIVE keyword. Subsequent CTEs in the same query will automatically inherit the recursive behavior from the first one.

That means that you don’t need to specify the RECURSIVE keyword again for any subsequent CTEs. Once the first CTE is declared as recursive, all subsequent CTEs within the same query will be treated as recursive by default.

Now, to the trickiest part. We want to retrieve all the fields and their nested children for a given field, no matter how many levels deep we have to go — using a recursive query, we could accomplish this task efficiently.

In the following code snippet, parent_chain is the name of our recursive CTE. You can notice that from the fact that it actually references itself. The anchor member (upper part) selects the fields of the deepest level (those without a single child and having some parent); these nodes represent the starting point of our recursive exploration. The recursive member (lower part) joins these fields with their parents referencing itself in the join iteratively until all nested levels are retrieved so that we accumulate the hierarchy for the field from the bottom to the top as we go further and union subsequent nodes.

...
...

parent_chain AS (
   SELECT 
       a.employee_id,
       a.parent_field_id,
       jsonb_agg(jsonb_build_object(field_name, field_value)) as json_object
    FROM a
    JOIN profile_field parents 
      ON a.parent_field_id = parents.profile_field_id
    LEFT JOIN profile_field children
      ON a.profile_field_id = children.parent_field_id
    WHERE children.parent_field_id is null
    GROUP BY 1, 2
        
    UNION ALL
        
    SELECT 
       t.employee_id,
       t.parent_field_id,
       jsonb_build_object(t.field_name, pc.json_object) as json_object
    FROM a t
    JOIN parent_chain pc 
      ON t.profile_field_id = pc.parent_field_id
    GROUP BY 1, 2, 3
)
  
...
...

The first part is the anchor part of the recursive query. It selects records from the previous step and creates JSON objects for the deepest hierarchy level. Take a good look; it joins a with the profile_field table twice:

once to get the parent fields, to collect parent field names for each field used in the second part,
and once to get the child fields, to filter out only records that have no children (remember, we are solving the tree from the “leaves” or the bottom-most level of the hierarchy).

Then we create JSON objects like {"Country": "United Kingdom"} using the jsonb_build_object function and aggregate them into a single JSON array with jsonb_agg. This JSON array represents the collection of key-value pairs for a particular node in the hierarchy. We group by the field associated with the parent because we want to gather all the children at the same level:

[{"Address Line 1": "Street"}, {"Country": "United Kingdom"}, {"State": ""}, {"City": "City"}]

In the end, it is your choice in what format you want to get the resulting hierarchical data, and it usually depends a lot on what options you have for integrating it. I chose to JSONify the obtained data just because JSON provides a versatile and flexible format for representing hierarchical data and is widely supported across different programming languages, platforms, and systems. By doing so, you gain the ability to capture the nested relationships in a natural and intuitive way.

The second part is actually where recursive querying is happening. The a CTE is joined with the parent_chain itself as many times as there are levels in the hierarchy. It selects rows based on the parent-child relationship until no more matches are found. With json_build_object, it then constructs new JSON objects containing nested JSON objects obtained in the first part, like so:

{"Home Address": [{"Address Line 1": "Street"}, {"Country": "United Kingdom"}, {"State": ""}, {"City": "City"}]}

Here is the result of executing the parent_chain CTE:

| employee_id | parent_field_id | json_object                                                                                                                   |
|-------------|-----------------|-------------------------------------------------------------------------------------------------------------------------------|
| 11,212      | 70,490          | [{"Address Line 1": "Street"}, {"Country": "United Kingdom"}, {"State": ""}, {"City": "City"}]                                |
| 11,212      | 61,970          | {"Home Address": [{"Address Line 1": "Street"}, {"Country": "United Kingdom"}, {"State": ""}, {"City": "City"}]}              |
| 11,212      | [NULL]          | {"Address": {"Home Address": [{"Address Line 1": "Street"}, {"Country": "United Kingdom"}, {"State": ""}, {"City": "City"}]}} |

What did we get here? The resulting table represents distinct levels of the hierarchical structure. Each record encapsulates a single level containing the values of all nested levels beneath it. In essence, these records serve as building blocks, progressively assembling the complete hierarchy from bottom to top.

What we have left to do is select only the complete hierarchies from the obtained result. And we can do this by filtering only those records that don’t have a parent (parent_field_id IS NULL). Yes, it’s that simple.

...
...

SELECT
   json_object
FROM parent_chain
WHERE parent_field_id is null

| json_object                                                                                                                   |
|-------------------------------------------------------------------------------------------------------------------------------|
| {"Address": {"Home Address": [{"Address Line 1": "Street"}, {"Country": "United Kingdom"}, {"State": ""}, {"City": "City"}]}} |

Today, we touched on recursive queries and their first but not the only application area — building out hierarchies from tree-like data structures. Beyond that, they are utilized for sequence generation, pathfinding, and many other things. And that’s really beneficial because you can perform complex operations directly within the database engine, minimizing data transfer and processing overhead. The whole point is to be able to put the hierarchy with an unlimited number of levels together using only SQL’s power.

Understanding Recursive SQL for Hierarchical Data Structures

Written by Igor Plotnikov