Recursion in SQL? But why? Oh, there are many uses for that. It’s common to store hierarchical data in SQL and recursive queries are a convenient way to extract information from such graphs. Organizational structure, application menu structure, a set of tasks with sub-tasks in the project, links between web pages, breakdown of an equipment module into parts and sub-parts are examples of the hierarchical data. The post will not go into great details of those many use cases rather look at two toy examples to understand the concept - the simplest possible case of recursion on numbers and querying data from the family tree.
Let's think about queries as a function. In a sense that a function takes an input and produces an output. Queries operate on relations or one could say tables. We will denote those as Rn. Here is a picture of a query. It takes three relations R1, R2, R3 and produces an output R. Simple enough.
SELECT <something> FROM R1, R2, R3 WHERE <condition>
Query can take something and produce nothing:
SELECT <something> FROM R1 WHERE 1 = 2
Take nothing and produce something:
Or take nothing and produce nothing
SELECT 1 WHERE 1 = 2
Recursion is achieved by
WITH statement, in SQL jargon called Common Table Expression (CTE). It allows to name the result and reference it within other queries sometime later.
Here is a sample.
WITH R AS (SELECT 1 AS n)
SELECT n + 1 FROM R
(SELECT 1 AS n) now have a name —
R. We refer to that name in
SELECT n + 1 FROM R. Here
R is a single row, single column table containing number 1. The result of the whole expression is number 2.
The recursive version of
WITH statement references to itself while computing output.
For the recursion to work we need to start with something and decide when the recursion should stop. To achieve this, usually recursive
with statement has following form.
Important to note that base query doesn’t involve R, but recursive query references R. From the first look it seems like infinite loop, to compute R we need compute R. But here is a catch. R actually don’t reference itself, it just references previous result and when previous result is empty table, recursion stops. Actually it could help to thing of it as an iteration rather then recursion!
Here’s what is happening: base query executed first, taking whatever it needs to compute the result R0. Second recursive query is executed taking R0 as input, that is R references R0 in the recursive query when first executed. Recursive query produces the result R1 and that is what R will reference to at the next invocation. And so on until recursive query returns empty result. At that point all intermediate results are combined together.
Quite abstract now. Lets take a concrete example, count until 3.
Base query returns number
1 , recursive query takes it under the
countUp name and produces number
2, which is the input for the next recursive call. When recursive query returns empty table (
n >= 3), the results from the calls are stacked together.
Watch out, counting up like that can only go that far. There is a limit for recursion. It defaults to 100, but could be extended with
MAXRECURSION option (MS SQL Server specific). Practically, it could be a bad idea to crank recursion limit up. Graphs might have cycles and limited recursion depth can be a good defense mechanism to stop poorly behaving query.
OPTION (MAXRECURSION 200)
Here’s another example, find ancestors of a person:
Base query finds Frank’s parent — Mary, recursive query takes this result under the
Ancestor name and finds parents of Mary, which are Dave and Eve and this continues until we can’t find any parents anymore.
Now this tree traversal query could be the basis to augment the query with some other information of interest. For example, having a birth year in the table we can calculate how old the parent was when the child was born. Next query do exactly that, together with showing lineages. To do that it traverses the tree from top to bottom.
parentAge is zero in the first row because we don’t know when Alice was born from the data we have.
Take away — recursive query references the result of base query or previous invocation of recursive query. Chain stops when recursive query returns empty table.
In the follow-up post we’ll take an algebraic view on SQL recursion and will look into recursive stored procedures.
[UPDATE] Post updated with comments from kagato87 and GuybrushFourpwood reddit users.
[NOTE] Code samples are for MS-SQL. Other DBMS could have slightly different syntax.