Upgrading spatial analysis of Origin-Destination data using modern vis frameworks, part 1 of 2
Spatially arranged Beziers and OD-matrices in ggplot2
giCentre re-designs for OD flow visualization
Analysing spatial patterns in origin-destination (OD) data is challenging. Inspecting power-law type distributions is a good place to start and network science contributes a useful repertoire of summary statistics. But often analysts working with OD data — migration geographers, transport planners — wish to undertake spatial analysis tasks that cannot be easily captured using summary measures alone.
This is of course where visual analysis approaches become necessary.
But representing spatial structure in OD flow data is not an easy undertaking. Problems of visual clutter (unintelligible hairball) and salience bias (giving undue emphasis to OD pairs that are relatively incidental) are all too familiar.
InfoVis researchers have worked at overcoming these frequent difficulties. There is some preferential selection here, but it is worth drawing particular attention to techniques published by the giCentre: Jo Wood and colleagues’ spatially arranged OD matrices — with a nice application in this paper by Aidan Slingsby.
Democratising visualization design
Clearly not all those doing applied data analysis have the time or inclination to learn to program such graphics from scratch.
Modern visualization frameworks such as Tableau, ggplot2 and recently vega-lite have done a great deal to make visual data analysis accessible. Rob Radburn has managed to create spatially-arranged OD matrices in Tableau with great effect.
Below I present examples of these layouts created in ggplot2 using OD travel-to-work data in London. As well as presenting some actual data analysis, in part 2 I hope to give more detail to persuade that writing high-level specifications to describe the graphics supports understanding of the encodings themselves.
Note: This post assumes some knowledge of ggplot2. I’ve also included a link to a reproducible github repo at the bottom of this page.
Most flow visualizations produced by a GIS (and many in ggplot2) involve drawing lines between OD pairs.
In ggplot2 this can be achieved with
geom_path() and passing to the argument defining the mapping of data to visuals —
aes() (the argument in ggplot2) — an array of coordinates representing OD pairs.
The map on the left is clearly unintelligible, but by parameterising
aes() in the same way as one would any other graphic (
alpha), we can expose a little of the structure we’d expect in an OD map of borough-to-borough commutes by car (top) and foot+bike (bottom).
The ggforce extension provides a function for generating various curves. Using
geom_bezier() (which has an API equivalent to
geom_path()) we can use the same approach published in this paper and represent OD pairs as asymmetric Bezier curves such that the origin borough is flat and destination curved.
The addition of the asymmetric Beziers exposes the fact that there are more commutes by bike and foot into central London boroughs (Westminster and City of London) than the reverse. This is true of outer London boroughs for car commutes: Harrow-to-Brent in the north west and Bexleyheath-to-Greenwich in the south east. But also that there are reciprocal commutes between some outer boroughs: Hillingdon-to-Ealing-to-Hillingdon in west London. More data analysis to follow in part 2.
But there are problems with this representation. By giving greater emphasis to more frequent flows we likely miss some interesting structure; longer lines have greater visual saliency than do shorter lines; and our flow lines imply unnecessary geographic precision.
Spatially-arranged OD-matrices (Wood et al. 2010) work by using a semi-spatial grid-like layout that presents each borough as a grid square at its approximate geographic location. The challenge with the grid layout is to preserve adjacencies as much as possible, while retaining a recognisable geometry of London. The maps below use the after the flood layout, but check out small multiples with gaps for a generalisable technique for automating these arrangements.
Each large grid square represents a destination borough (a workplace). Embedded within this is another spatially arranged grid map of origins with counts of commutes to the destinations (workplaces) encoded with colour darkness. Exactly the same information is presented as in a regular OD matrix. The only difference is that cells are spatially arranged. Notice that the destination boroughs (workplaces) are white and outlined. Here I’ve applied a local colour scaling: the darkest colour represents the maximum flows to the referenced destination cell.
Again, proper data analysis to follow in part 2, but this layout overcomes the salience bias problem and allows rich inferences into spatial patterns. We can quickly identify where there are smooth and abrupt and/or directional differences in the commuting profiles of boroughs. Westminster draws bike (+foot) commuters from a wider set of boroughs than does the City of London. We see commuters drawn from east and west of Greenwich, less so from boroughs north and south. That outer boroughs contain more localised flows than do inner boroughs is probably correct — but there are likely some edge effects here that I’m not accounting for.
If you understand the layout, reproducing it in ggplot2 is straight-forward. And if you don’t, forcing yourself to specify it in ggplot2 is instructive.
In the aftertheflood layout, each borough has a 2d grid location. Pass this, along with the count of the number of commutes to
geom_tile() in the usual way:
geom_tile(aes(x=gridX_orgn,y=gridY_orgn, fill=count)) . Then, facet the tiles on the destination (workplace) cell using
Spatially-arranged OD-choropleths and OD-bezier
Rather than embedding a spatially-ordered matrix of origins into each grid cell representing a destination (workplace), we could embed a choropleth map of origins. This might help with communicating what is a reasonably novel encoding to the uninitiated. Additionally, when making inferences about a behaviour that is likely sensitive to distance — cycle commuting — there are advantages to preserving geography as much as possible (even though the precision remains a little misleading).
Below are spatially ordered choropleths and, because we can, spatially-ordered Beziers.
Again, the ggplot2 specification is relatively straight-forward. I’m using the simple features package for working with geometry data and
geom_sf() — currently only available through the development version of ggplot2.
geom_sf() is parameterised almost exactly the same way to
geom_sf() and of course
It would be great to see these techniques used more widely by those working in spatial analysis domains. I’m currently working on a mini-data analysis and will soon share some findings in a second posting. In the meantime, I’ve created a github repository with a reproducible example.