Spatial data science for the uninitiated
The art and science of distilling insight from location information
An old friend messaged last week with a great question: “What is spatial data science (as opposed to regular data science) for us uninitiated? And why is it significant?”
What is spatial data science?
Spatial data science is the practice of distilling insight from spatial data using algorithms and analytical techniques.
Spatial data is any data where the relative positions of observations are described by the data, and can be used as a dimension in the analysis. Put another way, spatial data has information about where each individual datum is — and therefore, where the observations are in relation to each other.
The most intuitive example is geospatial data, which carries information about where things happen on Earth. Geospatial data can describe natural or human subjects like topography, political boundaries, urban systems, weather and climate patterns, road networks, distributions of species, consumption habits, shipping and logistics systems, demographics and so on.
The spatial dimensions usually are measurements of latitude (the y-coordinate) and longitude (the x-coordinate), and sometimes altitude (the z-coordinate), which can place a point precisely on, above or below the Earth’s surface. GIS skills and tools (geographic information systems / science) are often used by spatial data scientists to manipulate, analyze and visualize geospatial data.
It’s worth noting that many spatial analysis techniques are actually agnostic to the scale — so we could apply the same algorithms to a map of the Earth as we could to a map of a cell, or a map of the universe. And spatial data science techniques can be applied to more abstract problems with a spatial element — analyzing how closely associated words are, for example, by working out how often they are found together.
Raster and Vector
Spatial data typically falls into two categories: raster and vector. Both are ways to describe space and represent features, but they work quite differently.
Raster Data
A raster is a “grid of regularly sized pixels”. By assigning each cell in the grid a value — or a few values — images can be described numerically, as multidimensional arrays.
For example, take a 3x3 grid that looked like this:
If 1 means black and 0 means white, we could represent it numerically like this:
img = [[ 1, 0, 1 ],
[ 0, 1, 0 ],
[ 1, 0, 1 ]]
The numbers in raster cells can mean lots of things — the altitude of the land or depth of the sea at that specific point, the amount of ice or snow on that point, the number of people living within that pixel, and so on. Further, just about any color in the visible spectrum can be described by a combination of three numbers representing the intensity of Red, Green and Blue (RGB) — satellite images are raster data structures. GeoTiff, jpg, png and bitmap files contain raster data.
Vector Data
Vector data is a bit more abstract. In a vector dataset, features are individual units in the dataset, and each feature typically represents a point, line or polygon. These features are represented mathematically, usually by numbers that signify either the coordinates of the point, or the vertices (corners) of the geometry.
Points, Lines, Polygons
As a quick example, here’s a bare-bones numerical representation of each of these types of features:
point = [ 45.841616, 6.212074 ]line = [[ -0.131838, 51.52241 ],
[ -3.142085, 51.50190 ],
[ -3.175046, 55.96150 ]]polygon = [[ -43.06640, 17.47643 ],
[ -46.40625, 10.83330 ],
[ -37.26562, 11.52308 ],
[ -43.06640, 17.47643 ]]
// ^^ The first and last coordinate are the same
Vector features will often have some metadata included that describes the feature — the name of a road, say, or the population of a state. These extra, non-spatial metadata of a feature are usually called “attributes”, and are often represented in an “attribute table”. Very often spatial data scientists will combine the spatial dimensions (coordinates — for points, or coordinate arrays — for lines and polygons) with non-spatial dimensions in their analysis. GeoJSON and .shp files commonly contain vector data.
Why is this different from regular data science?
The short answer is that it isn’t: spatial data science is a discipline within data science. However, spatial data have some characteristics that make them need special treatment. This is true in the way the data is stored and handled from a programming / database perspective, and how it is analyzed from an algorithmic perspective. That means spatial data scientists have to learn some concepts — mainly from geometry, like representing 3D shapes on flat surfaces — that other data scientists might never have to deal with.
Tools for Spatial Data Science
Spatial data scientists try to make sense of these spatial datasets to better understand the system or phenomenon they’re studying. Some incredible (and often free) software tools make this possible. Most programming languages like Python, R and Javascript have amazing spatial analysis libraries like geopandas and turf.js, and desktop programs like QGIS make visualizing and analyzing spatial data accessible to less technical people. There are also powerful online tools like Mapbox, Carto and Google BigQuery to help with these analysis and visualization challenges. JavaScript libraries like Leaflet and Mapbox GL JS enable web developers to create interactive maps in the browser.
A few examples
Spatial data scientists might have the task of analyzing spatial distribution — to see if points cluster together, are spread out, or are randomly placed — to, say, work out the best place to build a new airport or retail center, or understand patterns of violence or crime.
It could entail analyzing trends through time — by seeing how a certain district’s voting outcomes evolved, or how sentiment toward some issue changed in different parts of a country.
Maybe analysts are analyzing satellite imagery to map unmapped areas to help deliver emergency services more efficiently, or work out how shady or sunny a new potential building site or bike route is. It might mean calculating the most efficient route to get from A to B given current traffic conditions. For as niche as the field is, spatial data science has a breadth of applications across just about every sector and field.
By analyzing spatial data with this specialized statistical toolkit, spatial data scientists are able to better understand spatial relationships, and, possibly, work out why things happen where, and predict where things will happen next.