Storing and retrieving tree structures in relational databases using Python (Django)
We all at some point bump into this problem, and there are multiple approaches to solve it when using a relational database. In this article I will be explaining Adjacency list, Nested sets and Materialized paths. I will also dig deeper into what schema is used, what are the pros and cons, django-apps that can be used for the same etc.
- Product Categories in E-commerce
- Organization department structures
- Lessons in a course
- Directory structure
These are few examples where we have to store entities in hierarchy. At any point we would want the whole or part of the tree of the entity, get siblings of the entity, get ascendants or descendants, move the entity around the family tree etc. Let us now call these entities as nodes. Hence, to be able to store the nodes with their tree structure and be able to efficiently retrieve them as needed, we will use Adjacency list, Nested sets and Materialized paths. Refer following node structure which we want to store in our database for this article:
If I had to represent the above structure in a python list, it would look something like:
And schema for the same would be storing a self-referencing foreign key. An additional attribute can be added which can describe the order of the sibling at a particular level.
Writes are pretty straight forward as you will have to assign parent to each node that is added. Lets look at how django-treebeard’s api for adjacency list works when it comes to retrieving whole tree. If you look at the source code of the function get_tree() it looks like:
So when you are getting a tree, in adjacency list implementation, we are recursively querying for each node. Thus a lot of queries will be fired depending on number of trees in your node. Thus making the reads expensive. This process can be optimized further if you use RECURSIVE queries in the database itself. Inserts and updates are pretty straight forward, as every node refers to its parents when added. If no parent is specified, that node is considered as the root node. Thus making the writes simple and straight forward. Lets look at Nested sets implementation now.
The nested set model is to number the nodes according to a tree traversal, which visits each node twice, assigning numbers in the order of visiting, and at both visits. This leaves two numbers for each node, which are stored as two attributes. Lets name these attributes left and right.
Schema can also include an attribute to identify the tree and also depth of the tree. If we use this schema and store the above tree in it, it would look something like this:
Now if at all we want to retrieve the whole tree or a part of the tree, we can make use of these attributes. For eg. I want the sub-tree which starts from N7. The left of N7 = 3 and right of N7 = 8. Thus I can simply command to retrieve all the nodes where left > 3 and right < 8.
Lets dig deeper in to django-treebeard’s implementation of the method get_tree() for Nested sets:
As you can see ‘ range ’ is used on ‘ lft ’ attribute to get the whole tree. Thus Nested set has an advantage over the above Adjacency list where in multiple queries where made to get the tree. We can accomplish this here with a single query. But things get very complicated when you move nodes around the tree or insert a new node in any sub-part, as from that node, till the end of the tree, for every node, left and right are recalculated. This is the reason, insert and updates are expensive with this approach. There is one more django-app which has a better approach to Nested sets known as django-mptt. If using Nested Sets, this plugin is recommended as they have optimized queries well. Also they have hooks if ever the tree gets corrupted.
In a materialized path approach, every node in the tree will have a path attribute, where the full path from the root to the node will be stored.
Now if we also include the depth and number of the child under that parent, the schema along with above diagrams data would look something like this:
I have used django-treebeard’s implementation of materialized path, thus the path has above format. In this approach, the ‘ like ’ is used when we are querying for the whole tree or part of a tree. Let’s look at the source code if get_tree() in django-treebeard’s implementation:
As you can see, django orm’s ‘ startswith ’ is used on the attribute path. Thus if we want all the descendants of ‘ N3 ’ , the query would ask for all the nodes whose path looks like ‘00010001’:
SELECT * FROM “content_nodemp” WHERE (“content_nodemp”.”path”::text LIKE ‘00010001%’ AND “content_nodemp”.”depth” >= 2
If you think LIKE’s are expensive, the path can be indexed for better performance. Django-treebeard says: “If you think that LIKE is too slow, you’re right, but in this case the path field is indexed in the database, and all LIKE clauses that don’t start with a %character will use the index.”
Thus our reads have a good level of optimization too with this approach. The inserts do require updates to certain nodes paths, but not a chain of them. This gets expensive when the whole tree is moved. But I would still consider this approach pretty much balanced.
- Use Adjacency List if your application has a lot of writes. Django-treebeards uses recursion at application level. This can be further optimized if recursive queries are used in database directly to avoid the round-trips.
- Use Nested Sets if your reads are way more than your writes. Would recommend django-mptt to implement this.
- If you have a good combination of reads and writes Materialized path is the way to go.