The Case for Distance-Bounded Spatial Approximations
Using distance-bounded spatial approximations to exploit modern hardware and accelerate spatial queries
Joint work by Eleni Tzirita Zacharatou, Andreas Kipf, Ibrahim Sabek, Varun Pandey, Harish Doraiswamy, and Volker Markl.
Eleni Tzirita Zacharatou and Varun Pandey are Postdoctoral Researchers at the Database Systems and Information Management (DIMA) group at the Technische Universität Berlin (TU Berlin). Andreas Kipf and Ibrahim Sabek are Postdoctoral Researchers at MIT CSAIL. Harish Doraiswamy is an independent consultant, and Volker Markl is a Professor and Chair of the DIMA group at TU Berlin.
Nowadays, the amount of spatial data collected in cities is increasing rapidly, thanks to the widespread use of mobile applications such as Uber. To analyze this data and make informed data-driven decisions, city officials and planners often rely on visualization frameworks that allow users to visualize data of interest over varying spatial and temporal resolutions. These frameworks generate popular visualizations, such as heatmaps, by executing spatial queries that perform different operations over spatial datasets, such as spatial selections, joins, and aggregations. To enable effective exploratory analyses, visual tools must provide interactive response times. However, the sheer size of the data combined with the complexity of spatial queries prohibit interactivity, which severely limits analyses.
In this post, we discuss why current spatial analytics solutions are unable to support interactive applications. Furthermore, we outline our vision of incorporating distance-bounded approximations in spatial systems to enable fast spatial analytics.
Exact Computational Geometry Operations Are Too Expensive
Spatial operators inherently use computational geometry primitives. Evaluating these primitives over real-world complex geometries is costly.
In particular, spatial analytics requires the ability to do point-polygon and polygon-polygon intersections quickly. Polygon intersection algorithms are quadratic. Real-world polygonal regions have complex shapes, often consisting of hundreds of vertices, and thus computing intersections over them may require millions of operations to evaluate a single result record. Multiply that by the fact that datasets can have hundreds of millions of records, and your query may take hours to complete.
The Filter-and-Refine Paradigm
To improve the performance of spatial queries existing spatial evaluation techniques use a two-step ‘filter-and-refine’ paradigm. The filter step evaluates the spatial predicate using approximations (e.g., bounding boxes) of the geometries, which yields a set of candidate results. The subsequent refine step eliminates false matches by performing exact geometric computations.
For the past decades, the filter step was the main bottleneck due to accessing approximations stored on slow storage. The compromise was to alleviate the bottleneck by sacrificing approximation precision for compactness. To that end, the most widely used spatial object approximation is the Minimum Bounding Rectangle (MBR), which is the smallest axis-aligned rectangle that encloses the complete geometry of an object. Minimum Bounding Rectangles are very compact to store, needing only two spatial points.
With the advent of faster high-capacity storage (such as NVMe SSDs, NVM), the system bottleneck has shifted from the filter to the refinement step. However, spatial query evaluation techniques did not keep up and still use coarse-grained approximations in the filter step, which results in a large number of costly refinements. Our measurements confirm that the refine step currently dominates the overall query execution time.
Figure 2 shows that refinements are detrimental to the overall spatial query performance. In this experiment, we joined two different point datasets (tweets and taxi rides) with three different polygon datasets (boroughs, neighborhoods, and census blocks). We indexed the points using three popular spatial indexes: the kdtree, the strtree, and the quadtree, which we implemented as described in our recent publication [2]. We implemented the join using the Google S2 library [1]. In particular, to perform the join, we first approximated each polygon with an MBR and queried the point index for each polygon MBR (filter step). We then performed the refinement step using the S2Loop.Contains function. As the results clearly illustrate, the refinement step takes more than 70% of the time for all index structures, while there are cases where it takes even more than 90%.
Raster Approximations to the Rescue
The above findings show that we need to reduce the number of costly refinements to improve performance, which we can achieve by improving the precision of the filtering step. Motivated by this insight and the fact that many applications only need approximate results, we propose to completely omit the refinement step and use distance-bounded approximations in the filtering step. Distance-bounded approximations allow controlling the maximum spatial distance between the approximate and the original geometry. As a result, users can achieve the desired level of accuracy, potentially at the expense of higher query execution time.
Interestingly enough, not all geometric approximations can be distance-bounded. Consider, for example, MBRs. The distance between a corner of the rectangle and the closest point in the object boundary can be rather large. In contrast, raster approximations are one class of approximations that can be distance-bounded. Figure 3 shows an example of a uniform raster approximation. As depicted in the figure, uniform raster approximations represent geometric primitives with a set of equi-sized cells.
Another benefit of raster approximations is that we can quickly compute them on the GPU, exploiting the GPU’s native support for rasterization. The rasterization operation converts a geometric primitive (e.g., a polygon) into a set of pixels which essentially form a fine-grained uniform grid approximation of the primitive. Therefore, we can compute the raster approximation of a spatial object by rendering it directly on the GPU. GPUs perform rasterization at interactive speeds, as they employ highly optimized hardware implementations. As a result, leveraging the GPU, we can compute raster approximations and evaluate spatial queries in real-time.
Distance-Bounded Approximate Spatial Query Processing
Our proposal is to incorporate distance-bounded spatial approximations in different spatial system components (i.e., data access, query optimization, and query execution) to enable approximate spatial query processing. Figure 4 shows an overview.
Data Access
Figure 3 essentially shows the logical data representation: a geometry is approximated by a set of cells, potentially along with additional information that denotes the cells intersecting with the geometry’s boundary. Given this representation, a data management system needs efficient indexes to store and retrieve the approximations. Since approximate query processing eliminates the expensive refinement step, the index lookup performance is crucial because it determines the overall query performance. While traditional indexes such as the R-tree are designed to index MBRs and do not apply to raster approximations, raster approximations enable new indexing opportunities. Specifically, in our work, we map the raster cells to a one-dimensional array by enumerating them with a space-filling curve and then use a learned index. By learning the position of the cells in the 1D array, the learned index achieves higher lookup performance compared to traditional spatial indexes such as the R-tree, the quadtree, and the kdtree.
Query Optimization
Existing approaches for spatial query processing are tied to specific geometric types and query classes. In particular, they use spatial operators that encapsulate both the filter and the refinement step. While the filter step relies on MBRs and is thus generic, the refinement step depends on the geometric type and operation. For example, finding the number of restaurant check-ins contained in a region requires point-in-polygon tests. If we change the input of the containment operator from check-in locations to restaurants represented by polygons, we need a different implementation that performs polygon-in-polygon tests instead. As a result, it is impossible to reuse physical operators across query classes and geometric types, which severely limits the set of optimization options.
By abstracting away from the specific object geometries, raster approximations overcome the above limitations. Specifically, they enable the evaluation of spatial queries using standard computer graphics operations: blend, mask, and affine transformations. These operations are independent of the specific geometries, and thus their implementation is reusable across query classes. For instance, both point-in-polygon and polygon-in-polygon intersection tests boil down to applying a combination of the above operations on raster approximations.
Figure 4 (bottom) illustrates two example operations. The blend binary operator merges two rasters into one. The blend function ⊙ defines how to perform the merge. The mask operator filters cells of the raster to retain only those cells that satisfy the condition specified by 𝑀.
Query Execution
As we highlighted above, we can evaluate spatial queries simply by combining a small set of operators that directly manipulate raster representations. Here, we discuss spatial aggregation queries as a representative example. Spatial aggregation queries compute summarized aggregate information about data that falls inside a given boundary. To achieve that, they need to join point data and polygon data.
We developed an algorithm that uses raster approximations to evaluate spatial aggregation queries approximately. Our algorithm, Bounded Raster Join (BRJ) [3], merges (using the blend operator) all the points into a single raster that stores partial aggregates, i.e., each raster cell keeps the aggregate of all points falling in that cell. Then, it joins this raster with the set of polygon rasters (by composing the blend and mask operators) to identify points that intersect with the polygons. Finally, it merges the results (using a combination of transformations and blending) to compute the aggregates. Figure 4 (right) illustrates a small example. The graphics pipeline on the GPU natively supports the above operations, leading to orders of magnitude speedup over typical evaluation strategies on CPUs without requiring any pre-computation.
Summary
Changes in application requirements and hardware necessitate us to rethink the role of geometric approximations in spatial data management. In the past, geometric approximations, such as the MBR, served as a fast and compact filter to identify candidate results. We show that due to the faster high-capacity storage (e.g., NVMe SSDs, NVM) available today, the usefulness of fast filtering diminishes rapidly. While the filtering time is negligible, refining the filter results requires expensive geometric tests that dominate the query execution time and prevent interactivity. To improve performance and take better advantage of modern hardware, we envision a new generation of spatial systems that incorporate distance-bounded spatial approximations at their core. To learn more about distance-bounded spatial approximations, you can read our recent CIDR 2021 paper [4].
Acknowledgements
This work was partially supported by the German Ministry for Education and Research as BIFOLD — Berlin Institute for the Foundations of Learning and Data (ref. 01IS18025A and ref. 01IS18037A).
References
[1] S2 geometry.
[2] Varun Pandey, Alexander van Renen, Andreas Kipf, Ibrahim Sabek, Jialin Ding, and Alfons Kemper. The Case for Learned Spatial Indexes (2020). AIDB @ VLDB.
[3] Eleni Tzirita Zacharatou, Harish Doraiswamy, Anastasia Ailamaki, Claudio T. Silva, and Juliana Freire. GPU Rasterization for Real-Time Spatial Aggregation over Arbitrary Polygons (2018). VLDB.
[4] Eleni Tzirita Zacharatou, Andreas Kipf, Ibrahim Sabek, Varun Pandey, Harish Doraiswamy, and Volker Markl. The Case for Distance-Bounded Spatial Approximations (2021). CIDR.