PostGIS tips: How to get insights from your data

This is Part 3 in a three-part series of Introduction to PostGIS.

I put these docs together in an effort to help my colleagues at the City of Boston get acquainted with PostGIS. You can find Part 1 here and Part 2 here.

This section contains an introduction to spatial indexing along with various examples of spatial joins, buffering, finding the nearest, and clustering.

Spatial Indexing

Spatial indexing is key to making our spatial queries run fast and are the thing that allows us to analyze large geo-datasets. At a high level, a spatial index creates a number of rectangles or bounding boxes for features in a dataset and uses those larger boxes to initially compare spatial relationships instead of examining every feature.

On an indexed table, a spatial query first compares the relationships between bounding boxes (faster and easier to do). Then uses the result of the bounding box query to figure out what features it needs to look at more closely.

You should only have to worry about adding a spatial index to a spatial table if you are the one adding it to the database. For tables that are in the production databases, the indexing should already be taken care of.

To create a spatial index on a table, run the CREATE INDEX command on your table making sure to include USING GIST (generic index structure) on your geometry column.

CREATE INDEX public_bathrooms_index
ON sandbox.public_bathrooms_testing USING GIST (geom)

Good Reads

Spatial Joins

Spatial joins are one of the most powerful things you can do with GIS. While normal table joins operate on a common attribute (e.g. a unique id), spatial joins operate on the spatial relationships between datasets. You don’t need a common identifier between tables, you just need to know where each one sits on the Earth.

Relying on spatial relationships means you can ask questions like “what city council district is this music venue inside?” or “how many bathrooms are in each City Council district?”.

Functions for spatial joining

Any function that evaluates the relationship between two layers and returns a true/false value can be used in a spatial join (a good list of those can be found here), but the main ones you’ll usually see in use are ST_Contains, ST_Within, and ST_Intersects.

For all of these spatial relationships to work correctly, you have to make sure the datasets you’re joining are in the same projection system.

ST_Contains

ST_Contains takes the geometry columns for two tables: ST_Contains(spatialLayer1.geom, spatialLayer2.geom), and returns TRUE if all parts of Layer 2 fall inside Layer 1.

For example, you can get the city council district each public bathroom is inside by joining the two layers on the public bathroom being inside the council district.

select
bathrooms.name as bathroom_name,
council.district,
council.councilor
from sandbox.public_bathrooms as bathrooms
join sandbox.city_council_testing as council
on ST_Contains(council.geom, bathrooms.geom2249)
Output from using ST_Contains to find the City Council district each public bathroom is inside of.

ST_Within, ST_DWithin

ST_Within is the exact same as ST_Contains, but the inputs are reversed. Instead of asking if Layer 2 inside Layer 1, you’re asking is Layer 1 inside Layer 2.

The real fun starts with ST_DWithin which allows you to find features that are within a specified distance of each other. The function takes geometry columns for two spatial layers in addition to the distance value you’d to use to compare the layers.

ST_DWithin uses the units of the projection system your data is in as the units for the distance value passed to the function. It’s good to have your data in NAD83 MA State Plane (EPSG: 2249) when using this function because:

  1. its measurements are more accurate for Boston
  2. its units are US Feet

You can find out the units of your projection system on epsg.io under the “Attributes” section.

Using ST_DWithin, we can do things like find all the public bathrooms that are within 1000ft of the City Hall Plaza music venue.

SELECT Distinct 
music.venue,
bathrooms."Name"
FROM sandbox.music_venues as music
JOIN sandbox.public_bathrooms_testing as bathrooms
ON ST_DWithin(bathrooms.geom2249, music.geom2249, 1000)
WHERE music.venue = 'City Hall Plaza'
Bathrooms within 1000ft of City Hall Plaza music venue location.

ST_Intersects

ST_Intersects returns true if any part of the geometries you are comparing overlap with each other — they don’t need to be inside of each other like the functions above. ST_Intersects takes two valid geometry columns or geometry values.

For example, ST_Intersects allows us to find all the streets that cross through city council district 7.

SELECT st_name AS street_name
FROM sandbox.boston_segments AS streets
JOIN sandbox.city_council_testing AS council
ON ST_Intersects(streets.geom, council.geom2249)
WHERE council.district = '7'

The output is a list of street names that intersect with that district. When we map this, we see that unlike in the examples above, one geometry does not have to be contained by the other, they simply have to overlap each other at some point.

Streets that intersect city council district 7 as a list and a map. The blue lines are the city’s streets, the red polygon is city council district 7, and the yellow lines are the streets that intersect the district.

Good Reads

Buffering

If you’re trying to find things within a certain buffer or distance of another layer, the ST_DWithin function may fulfill your needs, but if you need to create a buffer around a layer and hold on to the new geometry, ST_Buffer is the function to use.

ST_Buffer takes a geometry column and the amount you want to buffer the geometry by. Like ST_DWithin, the units for the amount of the buffer are in the coordinate system of your data. When using this function it is best to have your data in 2249 for the same reasons listed above.

select distinct
bathrooms."Name",
st_buffer(st_transform(bathrooms.geom4326, 2249), 500)
as buffer_geom
from analytics_internal_data.public_restrooms as bathrooms
500 foot buffer layer around each public bathroom.

When using the buffer function on its own, like we do above, a new polygon will be created for each feature in the dataset. Depending on how close features are in your dataset, these new polygons may overlap each other (as seen above).

If you’re using the buffer area to do something like exclude/include anything within 500 feet of a public bathroom in the city, it might make sense to merge these disparate polygons together. This will give you a more simple layer that simply marks the areas that are X distance away from a public bathroom.

The ST_Union function will dissolve all of the buffers into one large multipolygon. The geometries of any buffers that overlap each other will be merged together.

SELECT Distinct
st_union(st_buffer(st_transform(bathrooms.geom4326, 2249), 500)) as buffer_geom
from analytics_internal_data.public_restrooms as bathrooms
500ft dissolved buffer around public bathrooms.

Finding the nearest

The <-> operator in PostGIS give you the ability to order a table by each features distance from point, making answering questions like “what are the 10 nearest music venues to City Hall?” fairly straight forward.

SELECT "venue", "address", "promoter"
FROM sandbox.music_venues
ORDER BY geom <-> st_setsrid(st_makepoint(-71.057973,42.360391),4326)LIMIT 10;
The 10 nearest music venues to City Hall.

We can also use the operator inside a more complex queries that answer questions like “what is the closest public bathroom to each music venue?”

Here, we are using a sub-query that finds the closest bathroom to each music venue and joining it back to our music venue table.

SELECT
music.venue AS music_venue,
music.address AS venue_address,
bathroom_query."Name" AS bathroom_name,
bathroom_query."Address" AS bathroom_address
FROM sandbox.music_venues AS music
JOIN LATERAL (
SELECT
bathroom."Name",
bathroom."Address"
FROM sandbox.public_bathrooms_testing AS bathroom
ORDER BY ST_Transform(music.geom, 2249) <-> ST_Transform(bathroom.geom, 2249)
LIMIT 1
) AS bathroom_query ON TRUE
The closest public bathroom to each music venue.

Good Reads

Clustering

PostGIS has powerful clustering functions you can use when/if you’re doing advanced analysis on your data. The main two are ST_ClusterDBSCAN and ST_ClusterKMeans.

Using QGIS, you can run a select statement that determines the cluster_id for each feature in your dataset, then color the layer by that cluster id to see the visual output.

SELECT
venue,
geom2249,
gid,
ST_ClusterKMeans(geom2249, 5) OVER(order by gid) AS cluster_id
FROM sandbox.music_venues_in_boston
ORDER BY cluster_id
Output of KMeans clustering on music venue data.

Good Reads