How we scaled Rendering and searching a Tree Structure (100K nodes)
In this post, we talk about different techniques we tried out to render a Tree from 5K nodes to 100K nodes and providing search functionality on it.
Problem — We want to visualize a directory structure of a remote storage(Blob Storage S3) and provide a search feature too.
In an S3 bucket we store files under a unique ID of a cluster. Each cluster folder can have the same set of files with different content and users generally know the cluster whose files they want to view. (The files can be services logs, configurations, etc.)
So the structure of the bucket is represented as -
tbucket_name / cluster_id / files-*
Hence, here is what the data in the system for a cluster looks like
We started by designing the pipeline by assuming 5k~10k nodes (files) in the system.
We went through with the scale as we didn’t want to over-engineer for the solution beforehand. We will talk about each scale that we hit, and these 3 points about them —
And at last, we measure performance in all of the approaches.
Scale ~ 10K
Working with 10k nodes was not large. We stored the file paths into Mongo indexed with the cluster_id. Here is what a document looked like -
The backend was simple. We just queried for each cluster, and fetched all the files at the client end.
For the search part, we completely relied on the front-end to read all the data, construct a tree and handle search. Let’s see how we constructed a tree out of these strings.
We used a trie, with each node as a token(folder) in the complete path and “/” as a separator. Let’s see how it works -
This was the basic trie that was made and the nodes were used to store some meta-information about the children/parent.
Below we see the data structure definition of the Trie, and Trie Node that we used.
Note that the frontend application is written in React. And we use this framework https://ant.design/
The search was used as given in the official docs here https://ant.design/components/tree/
Scale ~ 30K
We didn’t change anything for the Storage and Backend API part when we hit this limit. Though the browser was becoming unresponsive during the search and loading part. The trie load and search needed optimizations.
What we optimized —
- Loading — We chose not to expand nodes by default which had more than 5 children. This reduced the DOM render time.
- Search — If you look at the search algorithm in official docs, it is unoptimized. Let’s see what they have implemented and how we can optimize over it.
Note: The user can search for files(leaves), and folders.
Official docs algorithm —
- Create a list of (key, title) tuple for the whole trie.
- Run the query against all the keys.
- Get parents of those matched keys, and expand them.
Complexity of a Query is O(trie_size*string_match + (matches*trie_size))
We optimized by storing information beforehand for step 3 by modifying the tuple as (key, title, parent).
So the new complexity is O(trie_size*string_match + matches)
This optimization drastically reduced the query time. And no issues were faced at this scale henceforth.
Scale ~ 50k
At this scale, our DB connections were giving errors inside the service while storing the data. We observed that storing the file_path as an array of strings is just redundantly storing and increasing document size. (bucket_name was redundant in all the strings)
We decided to store the data as a radix tree (We could have used trie as well, but decided to go one step ahead). A radix tree is a compressed trie.
The API response also saw a reduction in response times.
For the frontend, we kept processing in the same way as the Radix tree would not be intuitive for a directory structure. We still construct and store the tuple of (key, title, parent_name) for efficient search.
Scale ~ 100k
The frontend was having issues, and the API response times were also being high. We listed the issues as follows —
- DB is a bottleneck for our use case.
- Loading complete data at once won’t help.
- Search won’t work efficiently on the client-side with the current algorithm.
First, we moved our data to ElasticSearch. We also changed the data-structure to store our data.
Before ingestion into ElasticSearch, we split the Trie into “Sub-Tries”. The criteria to start a sub-trie is if the child-node has more than immediate 5 children, then that child-node will be a root for the new sub-trie. Each sub-trie also had a parent_meta_info as string appended keys from root to itself. And each parent-of-subtrie-root virtual-leaf (where the splitting happened) also stored meta-info of the sub-trie root keys.
The above steps were implemented in a simple inorder-traversal.
How it all works?
On the first render, the actual root trie of the cluster is fetched and displayed as a directory(It does contain a lot of sub-trie endings). If a user manually navigates to a subtrie-parent, we show them the subtrie-root-keys stored in the meta of that node. If a user clicks on one of them, we fetch the subtrie and render it with building the trie from the actual root node (with no other paths/children) using the parent_meta_info string split.
Now when a user queries, we query the elastic search and get all the subtries which have the term present. We only return the `root to match` node paths as the API response and on the client-side we build the trie with the paths. We also limit this list to the first 5k as small queries might match many nodes. (ex. query = “a”)
We can render the initial trie in under a second. Also the search introduced network latency but still works under 2 seconds (~800 ms on the server machine for curl) and outperforms the client-side search which used to make the browser unresponsive.
That’s all folks. Feel free to leave a comment below.