Augmenting Red Black Trees

Published in

The Startup

7 min readMay 2, 2020

There’s rarely a need for inventing new data structures. In most situations, using a textbook data structure will be sufficient. We don’t even need to worry about implementing them since common data structures are provided as part of standard class libraries of programming languages.

That being said, there are situations where none of the textbook data structures are quite fit to get the job done. With a little creativity, existing data structures can be customized to serve our needs. We augment existing data structures by adding operations and extra bits of information and therefore fitting them to serve our needs.

In this post, we will describe the general process of augmentation and then go over 2 examples of data structure augmentation. We use Red Black trees here but the ideas can be applied to any data structure.

The Process

The general process of augmenting a data structure consists of 4 steps:

Pick a data structure
Determine the extra information that needs to be added to make it suitable for the problem at hand
Define a strategy for maintaining the extra information after modifications (insert, delete, edit)
Add extra operations

The third step is usually tricky to get right because the goal is to keep the existing operations as efficient as they were before augmentation. For example, if a delete operation takes O(lgn) and maintaining the extra information after deletion takes O(n), then the delete operation for the augmented data structure will take O(n).

For Red Black trees, we can follow a simple rule to avoid this problem altogether. If updating the extra information for a node takes constant time and they only depend on the extra information of its children, then the operations can be kept as asymptotically efficient as the plain Red Black tree. The idea is that a change to a node will only affect the nodes above it, and since there are only O(lgn) nodes above any node, the time complexity of operations will stay the same.

On the contrary, if changing a node causes changes in nodes other than its ancestors, that change can potentially affect n other nodes, increasing the time complexity to Ω(n).

Examples

In this section we will go over 2 examples of augmenting Red Black trees. One of the main advantages of Red Black trees is their insert and delete efficiency, while keeping the elements sorted. This makes Red Black trees an ideal data structure for dynamic sets of data, that is, when the insertion and deletion operations are intermixed with look ups on underlying data.

Order Statistic Trees

The selection problem is the following

Given a set of unordered values and a number k, find the kth smallest element of that set

This problem can be solved efficiently for static data sets — data sets that don’t change — by using an array and the Quick Select algorithm. This algorithm has expected running time of O(n), which is optimal.

But what if we’re dealing with a dynamic data set? If an (dynamic)array is used, insertions and deletions can potentially take O(n), thus if we need to perform m operations, the total running time of using Quick Select will be O(mn).

Red Black trees have O(lgn) look up, insert and delete operation. While searching for a key is efficient, there’s no obvious way for finding the kth smallest element. The naive approach would be to perform an in-order walk of the tree and keep track of the number of seen elements. This approach will take O(n) which is no better than Quick Select.

A more efficient solution is to keep track of the size of each node, that is, the number of nodes in the sub-tree rooted at that node. By knowing the size of a node, we can decide to stop the search once we found the kth smallest element or continue the search in the left or right branch in constant time and consequently, solving the problem in O(lgn). The algorithm is straight forward:

For each node, starting from root:

If k is equal to its left sub-tree size + 1, then it is the kth smallest element
If k is smaller than its left sub-tree size + 1, find kth smallest element in the left sub-tree
Otherwise, find the (k — left sub-tree size + 1) smallest element in the right sub-tree.

Since the size of each node depends only on the size of its left and right children, we can maintain this information in constant time for each node, thus the insert and delete operations will remain O(lgn).

Let’s call this operation Select. We can introduce another operation, called Rank which determines the position of a node in the sorted data set. A tree that supports Select and Rank operations is called an Order Statistic Tree.

The total running time of m operations will be O(mlgn), which is a significant improvement over the O(mn) running time of Quick Select for large sets.

Interval Trees

We’re facing the following problem

Given a set of intervals and an interval k, determine if k overlaps with any interval within the set

For static sets, the solution is simple. We can store the intervals in an array and find the conflicting intervals in O(n) time. Even if we need to perform m queries, we can find out if there’s an overlap by first sorting the intervals according to their start and size and then performing binary searches, for a total time of O(mlgn). On the other hand, if the set is dynamic, we will run into the same problems with the previous example and end up with O(mn).

We know that Red Black trees offer O(lgn) insert and deletes. If we could also store intervals in them in a way that for each node we can decide if the potentially conflicting interval is on the left or right branch in constant time, we would end up with O(mlgn) time for performing m operations. This is a significant improvement over O(mn) for arrays.

To be able to reliably decide which branch to choose in case a node doesn’t overlap with interval k, we need to do two things:

Use starting point of intervals as the key for each node. By doing this, an in-order walk of the tree will return the intervals sorted by their start.
In addition to storing the intervals, we keep track of the maximum end point of any interval in the sub-tree rooted at each node. We’ll call it maxEnd.

By making these changes, we can reliably choose which branch of any node potentially contains a conflicting interval with interval k. We add an operation, Overlaps(k), which takes an interval as input and returns true if there’s an overlap.

The algorithm is simple. For each node, starting at root:

If its interval overlaps with k, then return true
If k’s starting point is smaller than maxEnd of its left sub-tree, then we continue in the left sub-tree
Otherwise continue in the right sub-tree.

We’ll keep going until we either find an overlap or reach the leaves, in that case there’s no overlap.

To see why in step 2, we can be certain that if there’s an overlap, it must be located in the left sub-tree, we use a proof by contradiction. We assume that k’s starting point is smaller than maxEnd and the right sub-tree contains the only overlapping interval.

Since there’s no overlap with any interval in the left sub-tree, k’s end point must be smaller than starting point of any interval in the left sub-tree whose endpoint is larger than k’s starting point.

Left sub-tree’s intervals are colored in light blue.

Because there is an overlap with an interval on the right sub-tree, then the overlapping interval’s starting point must be smaller than k’s end point. Since there is a node on the left sub-tree whose starting point is larger than k’s end point, we have a node on the left sub-tree which has a larger starting point than one on the right.

Right sub-tree intervals are colored in dark blue and the ones in the left sub-tree are colored in light blue.

That’s a contradiction because we use the starting point of intervals as keys, so the starting point of every interval in the right sub-tree is larger than the ones in the left sub-tree. So by contradiction, if the starting point of k is smaller than the maxEnd of the left sub-tree and there’s an overlap, it must be located in the left sub-tree.

Since the maximum end point of intervals in the sub-trees depend only on the internal nodes of that sub-tree, we can guarantee that the insert and delete operations take no more than O(lgn). As a result, performing m operations of insert, delete and Overlaps(k) on an interval tree takes O(mlgn).

Interval trees can be used to detect overlaps between rectangles efficiently. Suppose we have a set of rectangles and we want to see if there’s an overlap between any of them.

One solution would be to sort the rectangles according to the x-coordinate of their top left point, and go over them one by one.

For every left edge, we add the an interval representing the height of the rectangle to an interval tree, and once we encounter the right edge, we remove that interval from the tree. You can visualize the progression of this algorithm by imagining a vertical sweep line going from left to right.

Sweep line is pictured in red. Two rectangles are currently active. The interval tree contains two intervals, one representing the height of the outer rectangle, and the other representing the height of the inner rectangle.

The purpose of the interval tree is to keep track of active rectangles, that is, the ones that are crossed by the vertical sweep line. For every interval added, we first call Overlaps(k) to see if there are any overlaps.

Conclusion

In this post we outlined the process of augmenting Red Black trees to solve problems that are not solvable by using plain Red Black trees. We went over a couple of practical examples and saw that by using a dash of creativity, we can solve problems in an efficient manner without inventing new data structures.