Spatial Data Management: Spatial data models, spatial data system development and spatial indexing: an overview

Chamin Hewage
LiIDS MAGAZINE.am
Published in
8 min readSep 29, 2021
Navigating through the space and time

Geographical world can conceptually model as entity or field oriented representations. Entity (or the object) based models are used to precisely define and recognize spatial entities/objects as non-divisible building blocks that inherit spatial properties (e.g. house, signpost, river). Field oriented representations are used to model spatial phenomena that vary smoothly over a continuous geographical area (e.g. elevation, temperature). The option to use which conceptual model, i.e. entity or field, influences the choice of the geographical data model. Geographical data models pave the way for structuring spatial objects/phenomena correspond to a spatial location by decomposing spatial objects /phenomena into multiple parts for spatial analysis and communication. Today, vector data models and tessellation of fields data models are widely applied in the geographical data modeling (Burrough2015).

Typically, the vector data model is linked to the entity approach. Hence the vector model can be used to distinctly identify spatial objects by its attributes and geographic location. The Vector data model employs three fundamental geographic data primitives, namely, point, line, and polygon (or volygon in 3D space) to distinctly represent the geographic entity in x,y,z coordinate space. A point has a zero dimension and possesses only the location property. A line is one-dimensional and possesses the length property. In polygon/volygon, the 2D/3D spatial object is represented along with area/volume property. Once the entity data model has point, line, polygon/volygon vector data, the next crucial thing is to represent the spatial relationship between the identified entities through the topology. In more formal terms, topology is the study of properties of geometric objects that remain invariant under different transformations such as stretching or bending (Massey1967) and often described through graph theory (Wilson1990) (Burrough2015) (Chang2006).

In tessellation, the continuous spatial field is subdivided into discrete spatial segments which can be either regular or irregular. The regular (e.g. equal size triangles, squares) and irregular shapes (e.g. triangles of different shapes) permit the engagement of different types of analysis and modeling for continuously varying spatial data. In today’s world, regular tessellation (or regular grid) based spatial data modeling is quite prominent. This sort of regular tessellation or regular grid-based spatial data models is called raster data models. A raster grid consists of rows, columns, and cells. One fundamental quality of a raster cell (also known as a pixel in 2D space or a voxel in 3D space), is that its geographic location is explicitly defined by the row and column positions in the grid. Furthermore, in raster representations, it is assumed that the variation in the continuous grid cells to be mathematically continuous. The cell values are typically expressed as integer values or floating-point values (Burrough2015) (Chang2006).

In a nutshell, it can conclude that most geographic spatial data (both entity and field) can represent either as vector data or raster data in the forms of points, lines, polygons/ volygons, 2D/3D regions/surfaces, and volumes associated to specific locations in the space (i.e. x,y,z coordinate). Albeit depending on data, each application could apply custom data management solutions, all geography oriented data management techniques coined from the same underlying principles/concepts of ‘spatial data management’. Henceforth, for brevity, the remaining part of this section uses the generic term ‘spatial data’ instead of ‘geographic spatial data’. Also, instead of vector/raster data management, the term ‘spatial data management’ is considered.

Today, spatial data is at the core of many computer science research fields such as geographical information systems, computer vision/graphics, VLSI, image processing, computer-aided design(CAD), solid modeling, robotics, etc. (Rigaux2001) (Samet1990). Often, these applications store spatial data in a database environment called spatial database- an environment specially designed to handle spatial data. Spatial databases implement tailored data structures atop data to ensure efficient access to spatial data in the databases. These tailored data structures are known as spatial indexes or spatial index structures or multidimensional access methods.

Spatial indexing is the crux of spatial data management. It guarantees fast retrieval of spatial data by minimizing the disk reads. Furthermore, a well designed spatial index is extensible and aims to harness the storage in an efficient manner while keeping the ratio between spatial index size to actual data at a minimum level (Lu1993). Nevertheless, spatial indexing is different from conventional one-dimensional database indexing and goes beyond the concept of primary key. The primary reason is that spatial data are inherently multidimensional complex non-zero size objects and the spatial indexes are constructed based on associated coordinates for the object in the Euclidean space (Lu1993). Not on separate attributes x,y, or z. If a general composite index is created in a 2D space considering x, and y coordinates, two distinct indexes have to be searched to retrieve the data. Even though composite indexes could solve multidimensional issues such solutions are not viable techniques for spatial data management. It is an overhead that gets further exacerbated when the dimension of spatial data increases (Laurini1992). Therefore, B-tree and linear (or extensible) hashing like one-dimensional indexing structures are typically not suitable for indexing spatial data (Ooi1993). Hence, designing efficient spatial index structures for spatial data is non-trivial and should adhere to a systematic approach. According to Gaede1998 the design of spatial index should be based on the spatial data and the queries that intend to perform over the spatial data.

From the view of Samet2006, the queries of interests can mainly be of two kinds (i) feature query and (ii) location query. A feature query provides a spatial object as the input and retrieves the spatial location as the output. This is done by explicitly representing the spatial object by organizing objects through re-ordering of the space. Thus, the object-based indexing techniques are primarily implemented for feature queries. A location query provides a location and retrieves the spatial objects that occupy the location. Hence location based or space-based (image based/ cell based) spatial indexing techniques are often implemented for location queries. Space-based indexing could further classify as fixed-grid and adaptive-grid indexing. Fixed-grid is constructed by virtually super imposing an equal size grid on top of the spatial region. Space-filling-curve based space ordering techniques such as Morton curve (also known as Z order and N order curve) and Piano-Hilbert curve (also known as Hilbert curve) belong to the family of fixed-grid based index methods. Fixed-grid-based hierarchical index structures are typically constructed by mapping the values obtained from space ording techniques with one-dimensional tree structures such as a B+-tree. Adaptive grid techniques can be based on either hierarchical structures or non-hierarchical structures. Hierarchical structures can be quadtree/octree or their variants such as point quadtree, region quadtree/region octree, or other hierarchical techniques such as k-d tree or its variants (e.g. K-D-B tree. Grid file index structure can be recognized as a non-hierarchical structure.

While space-based and object-based techniques primarily focus on location queries and feature queries separately, Samet2006 further states that these techniques can improve to support both queries. This is achieved through propagating the spatial objects and space occupied by objects up along a hierarchy. For example, object-based techniques use space approximation for objects, termed a bounding box, and propagate these bounding boxes up in the hierarchy. R-tree and its variants such as R+-tree, packed R-tree and R*-tree are based on this concept. Nevertheless, the aforementioned space-based and object-based spatial index structures classification by Samet2006 is not the only classification.

In the literature, spatial index structures are introduced under different taxonomies and they are based on different perspectives. Gaede1998 introduced two classes of multidimensional access methods from the perspective of how spatial data can be organized in the secondary storage. The two classes are namely, (i) point access methods (PAM) and (ii) spatial access methods (SAM). PAM enables efficient search of multidimensional point data by organizing point data into a number of buckets each of which corresponds to a disk page and to a particular subspace of the spatial universe. Such PAM includes multidimensional hashing techniques (e.g. Grid files, extensible cell or EXCELL, and hierarchical structures (e.g. K-D-B tree, LSD-tree), and SFC based point access methods (e.g. Hilbert curve and Z-ordering curve). On the other hand, SAM are designed to manage multidimensional objects with spatial extent by modifying PAM following a technique such as (i) transformation (or object mapping, (ii) overlapping regions (or object bounding) (e.g. R-tree, R*-tree), (iii) clipping (or object duplication) (R+-tree) and multiple-layer approach (e.g. multi-layer grid file). The first three techniques, namely, transformation, overlapping regions, and clipping methods are synonymous with the SAM classification introduced by Kriegel1988. This classification by Kriegel1988 also embraced by Ooi1993 in introducing a taxonomy for spatial index structures. Furthermore, Ooi1993 gives a meticulous analysis of the strengths and weaknesses of different spatial index techniques. Nam2004 classified multidimensional indexing trees as space partitioning methods (e.g. K-D-B tree) and data partitioning methods (e.g. R-tree, R*-tree) and studied their impacts on range queries.

As of 2020, many mainstream databases including Oracle (Oracle Spatial and Graph), PostGIS/PostgreSQL, IBM (e.g. IBM Spatial DB2 Spatial Extender), Ingres (Ingres Geospatial) and MySQL (MySQL Spatial) have spatial indexes built within them to support spatial data management. Furthermore, since 2010, there has been phenomenal growth in incorporating spatial index structures at the core of developing big spatial data systems (BSDS) atop NoSQL databases and distributed file systems. This trend is also visible in the current state-of-art scalable aerial LiDAR data management systems.

References:

  1. Burrough2015 — Burrough, P.A., McDonnell, R., McDonnell, R.A. and Lloyd, C.D., 2015. Principles of geographical information systems. Oxford university press.
  2. Massey1967 — Massey, W.S., 1967. Algebraic topology: an introduction. New York: Springer.
  3. Wilson1990 — Wilson, R.J. and Watkins, J.J., 1990. Graphs: an introductory approach: a first course in discrete mathematics. John Wiley & Sons Inc.
  4. Chang2006 — Chang, K.T., 2006. Introduction to geographic information systems (pp. 117–122). Boston: McGraw-Hill Higher Education.
  5. Rigaux2001 — Rigaux, P., Scholl, M. and Voisard, A., 2001. Spatial databases: with application to GIS. Elsevier.
  6. Samet1990 — Samet, H., 1990. The design and analysis of spatial data structures (Vol. 85, p. 87). Reading, MA: Addison-Wesley.
  7. Lu1993 — Lu, H. and Ooi, B.C., 1993. Spatial indexing: Past and future. IEEE Data Eng. Bull., 16(3), pp.16–21.
  8. Laurini1992 — Laurini, R. and Thompson, D., 1992. Fundamentals of spatial information systems (Vol. 37). Academic press.
  9. Ooi1993 — Ooi, B.C., Sacks-Davis, R. and Han, J., 1993. Indexing in spatial databases. Unpublished/Technical Papers.
  10. Gaede1998 — Gaede, V. and Günther, O., 1998. Multidimensional access methods. ACM Computing Surveys (CSUR), 30(2), pp.170–231.
  11. Samet2006 — Samet, H., 2006. Foundations of multidimensional and metric data structures. Morgan Kaufmann.

--

--

Chamin Hewage
LiIDS MAGAZINE.am

I am a Data Systems' scientist (PhD). I work as a senior Database Engineer at HPE. I aspire to bring state-of-the-art to mainstream through innovation.