Bulk loading R-trees and how to store higher dimension data
R-trees? More like B-trees on steroids
R-trees are tree data structures used for spatial access methods, i.e., for indexing multi-dimensional information such as geographical coordinates, rectangles or polygons.
An object is characterized as spatial if it has at least one attribute that captures its location in a 2D or 3D space. Moreover a spatial object is likely to have geometric extent in space. For example, we can say that a building is a spatial object, since it has a location and a geometric extent in a 2D or 3D map.
There is no total ordering in the multidimensional space that preserves spatial proximity. As a result, objects in space cannot be physically clustered to disk pages in a way that provides theoretically optimal performance bounds to spatial queries.
Assume that we try to sort a set of two-dimensional points using some one-dimensional sorting key. No matter which key we use, sorted by their column a Peano curve or a Hilbert ordering, we cannot guarantee that any pair of objects, which are close in space, will also be close in the total order.
R-trees are mapping our world and this is a real world application. They store spatial data objects such as shop locations and every other shape that creates a map. For example the geographical coordinates that create a polygon from a city block or your favourite restaurant.
Why this is useful?
Hey Siri! find the closest McDonald’s from me
That’s right. Queries according to the actual location of an object.
The “R” in R-tree stands for rectangle
The key idea of the data structure is to group nearby objects and represent them with a rectangle. The minimum bounding rectangle or MBR for short. This happens recursively.
Since all objects lie within this bounding rectangle, a query that does not intersect the bounding rectangle also cannot intersect any of the contained objects. At the leaf level, each rectangle describes a single spatial object. All the other levels are simply node pointers.
A constraint of the tree is that each node (except from the root) should be at least 40% full so that disk space is utilized properly. Below we can see an example of an R-tree.
There are four things that we have to do in order to keep our R-tree utilized.
- The area covered by the MBRs of the directory node entries should be minimized.
- The overlap between MBRs of the directory entries at the same level should be minimized.
- The margins of the MBRs of the directory node entries should be minimized.
- The average number of entries in an R–tree node should be maximized.
Bulk loading is a way to load data, typically into a database, in ‘large chunks’. We will use a method that is called sort tile recursive (STR).
Every MBR contains the following separated by a tab(‘\t’): object-id, x-low, x-high, y-low, y-high. Below we can se a function that computes the MBR of a list of rectangles. Each rectangle is stored in a string.
The STR technique initially sorts the n rectangles with respect to the x-low of the rectangle. After sorting, we know that there will be L = ⌈n/C⌉ leaf nodes in the tree. Attribute C refers to the capacity of each node. Specifically how many rectangles it can store.
The (sorted) rectangles are then divided into ⌈SquareRoot( L)⌉ groups (vertical stripes) and each group is sorted using as a key the y-low of the rectangle. The output from this sorting is packed into leaf nodes and the tree is built in a bottom-up way with recursion calls.
The recursion is over when the number of slices is 1. That means we reached the root level.
Lets query the tree
The tree is traversed in a depth- first manner, following pointers of entries which intersect the query range. The query algorithm takes three parameters; the query range q, the predicate θ the retrieved objects should satisfy with q, and an R–tree node n.
If we are on a leaf node we search every MBR of this node to check if they satisfy our query. If they do we add the rectangle to the answer set.
If we are not on a leaf node we we search every MBR of this node to check id they satisfy our query. If they do we recursively call the function. We use the same parameters. The only parameter that changes is the node. We are now passing the node that is pointed by our current one.
In simple words we want to check if we have an intersection with the MBR of the query.
We want to be able to do 3 query types. Intersection, for MBRs that have at least one common point, Inside, for MBRs that are inside the query and Containment, for MBRs that contain the MBR of the query.
Every non-leaf node must be checked only for intersection. We can see below with an example that we would miss out a lot of results if we strictly filtered all MBRs on non-leaf nodes.
Consider an inside query and a step at a non-leaf node. The black rectangle is the MBR of the query. The red is an intermediate node being filtered and the green of an actual object. If in the recursion we had a strict rules for the MBRs to be completely inside to be examined the red would not pass the filter so we would loose the green one that is an actual object that fit the criteria.
The criteria for accepting an object depends on its geometry and can be seen below.
The way spatial selections are processed in a spatial database systems depends on whether the queried relation R is indexed. If R is not indexed, we scan it linearly and compare each object with the query range. As discussed, the DBMS applies the two-step processing technique, which tests the object’s MBR against the query, before its exact geometry. If the relation is indexed (e.g., by an R–tree) then we can use it to find fast the objects that qualify the filter step.
References: Advanced Database Technologies Lecture notes by Nikos Mamoulis, University of Hong Kong