Let’s Make a Map in R*

This is a short introduction to creating choropleth maps in R. We’ll show you how to download Ordnance Survey boundary layers, merge them with 2011 Census data and visualise the results in classed and unclassed choropleth maps.

The full R code used in the tutorial is available from here.

Setup

First we need to install a number of R packages …

install.packages("sf", "tidyverse", "classInt", "viridis")

and then load them into our R session:

library(sf) ; library(tidyverse) ; library(classInt) ; library(viridis)

Reading spatial data

An excellent resource of geospatial data for Great Britain is the Office for National Statistics’ Open Geography Portal. Here you can download vector boundary data for a variety of administrative and statistical geographies. Most of the vector boundary files are available under the OS OpenData and the Open Government Licensing agreements but you need to acknowledge ONS and OS copyright and source in any outputs you create. More information about using the Portal is available from their FAQs.

We are going to download digital vector boundaries for electoral wards in Greater Manchester. The Portal provides vector boundary layers in three different resolutions (full, generalised and super generalised) and two formats (ESRI shapefile and GeoJSON). We are going to use the 2017 ward vector boundary layer with a super generalised resolution in GeoJSON format. A super generalised resolution is ideal for data visualisation — less detail means shorter loading times — and GeoJSON is an open format that comes in a single file.

You can identify the URL for the GeoJSON file by selecting the ‘APIs’ button and copying the GeoJSON path. Next we use the st_read function from the sf package to download the GeoJSON vector boundary layer and assign it to an object called ‘sf_gb’.

sf_gb <- st_read("https://opendata.arcgis.com/datasets/07194e4507ae491488471c84b23a90f2_3.geojson", quiet = TRUE)

The GeoJSON is loaded as an sf object, a data frame with features stored in rows and attributes in columns. The feature geometries of the object are stored in a list-column at the end. We can have a look at the attributes using the glimpse() function from the tidyverse package.

glimpse(sf_gb)

We note that the sf object has 8,669 features (rows) and 10 attributes (columns). In other words, there are 8,669 ward boundaries and 10 variables describing those features such as name (wd17nm) and total area (st_area_shape).

To inspect the geometry of the sf object we use the st_geometry function.

st_geometry(sf_gb)

The R console prints out the number of features, the geometry type, dimensions, bounding box, coordinate reference system (CRS), projection and the first five feature geometries (i.e. polygon coordinates). We can confirm that there are 8,669 features (wards) and that the vector boundary layer is projected in the World Geodetic System 1984 (WGS84), with longitude and latitude units of decimal degrees.

We can visualise the vector boundaries using a combination of the base R plot() function and st_geometry().

plot(st_geometry(sf_gb))
Wards in Great Britain

Filtering spatial data

Currently, we have all the wards for Great Britain but we are only interested in those for Greater Manchester. Unfortunately, there is no local authority field in the attribute table so we need to use a lookup table. Helpfully, the Open Geography Portal provides a range of lookup tables that allow users to match different geographies. Since we are using electoral wards for 2017 we need the Ward to Local Authority District (December 2017) lookup.

We can use the read_csv function to load the lookup table, filter() the results by the ‘LAD17NM’ variable for local authorities in Greater Manchester, and pull() out the corresponding ward codes (WD17CD).

lookup <- read_csv("https://opendata.arcgis.com/datasets/046394602a6b415e9fe4039083ef300e_0.csv") %>%
filter(LAD17NM %in% c("Bolton","Bury","Manchester","Oldham","Rochdale","Salford","Stockport","Tameside","Trafford","Wigan")) %>%
pull(WD17CD)

Now we need to filter the sf object by the Greater Manchester ward codes. Since sf objects work seamlessly with the tidyverse we can use the same filter() function that we used for the ward-local authority lookup data frame.

sf_gm <- sf_gb %>%
filter(wd17cd %in% lookup)

Let’s check the results. We have all 215 electoral wards for Greater Manchester.

plot(st_geometry(sf_gm))
Wards in Greater Manchester

We don’t need all of the attributes in the ward vector layer though. So we’ll use the select() function to choose and rename the ward code and name variables.

sf_gm <- sf_gm %>%
select(area_code = wd17cd, area_name = wd17nm)

Joining spatial data

Now we need some Census data. Let’s use the Nomis Table Finder to find 2011 Census data available at ward level. We’ll look at table QS502EW, which provides information on qualifications gained by residents aged 16 and over. There are various ways to pull data off of Nomis but we’ll just download the entire dataset for all wards. You can do this by selecting ‘Download (.csv)’ in the lefthand panel and choosing ‘wards 2011’ from the dropdown menu. A comma separated file called ‘bulk.csv’ will be automatically downloaded. We’ll then move the CSV file into a folder called ‘data’ and read it into our R session.

df <- read_csv("data/bulk.csv")

Next we need to identify the variables of interest, rename them and create a new variable using the mutate() function which represents the percentage of the usual resident population that have gained no qualifications.

df_census <- df %>%
select(area_code = `geography code`,
n = `Qualification: No qualifications; measures: Value`,
total = `Qualification: All usual residents aged 16 and over; measures: Value`) %>%
mutate(percent = (n/total)*100)

Now we are ready to join the Census data to the ward vector layer. This is achieved using the left_join() function and by supplying the sf object, data frame and the variable to join by.

sf_gm_census <- left_join(sf_gm, df_census, by = "area_code")

When we glimpse() the resulting object we observe six variables. The ward code, ward name and variables from the sf object and the three census variables from the data frame.

Creating a choropleth map

Now we are ready to create a choropleth map. A choropleth map uses colour to show variation in the values of a variable across administrative areas. There are two types of choropleth map: classed or unclassed. In classed choropleth maps values are grouped into intervals using a variety of classification methods like equal intervals, quantiles and natural breaks. These classes, typically between 3 and 7, are then mapped to a few discrete colours. In an unclassed choropleth map a continuous colour scale is used with gradations of colour indicating low to high values of a particular variable.

To create a classed choropleth map we need to assign groupings to the ‘percent’ variable. There are several ways to do this but the simplest is to assign an equal number of observations to each group. We can do this using the cut_number() function from the ggplot2 package. All we need to do is to supply the variable name and the number of class intervals.

Now let’s plot. There is a lot going on in this code, so I have provided comments in the code.

Equal intervals is a very blunt classification method. It’s not sensitive to outliers in a distribution and will often create empty classes. Natural breaks (Fisher-Jenks algorithm) is a more sophisticated classification method because it creates distinct groups of similar values by minimising the sum of variance in created classes. We’ll use the classIntervals() function from the classInt package. We just need to supply the ‘percent’ variable that we want to group into classes, the number of classes, and the classification method. We’ll choose 5 class intervals and the natural breaks or ‘jenks’ method for grouping values.

classes <- classIntervals(sf_gm_census$percent, n = 5, style = "jenks")

You can check the class intervals by running:

classes$brks

Next we’ll create a new column in our sf object using the base R cut() function to cut up our percent variable into distinct groups.

sf_gm_census <- sf_gm_census %>%
mutate(percent_class = cut(percent, classes$brks, include.lowest = T))

Now we are ready to plot again. The code is unchanged from the previous plot except that the fill aesthetic is mapped to the newly created ‘percent_class’ variable.

Finally, we’ll create an unclassed choropleth map using a continuous viridis colour palette. The code is largely unchanged except that the fill aesthetic is mapped to the ‘percent’ variable and we are using the scale_fill_viridis() function from the viridis package.

The decision whether to create a classed or unclassed choropleth map is a tradeoff between subjectivity and legibility. There is no ‘correct’ classification method to use in a classed choropleth map and it can can be difficult to judge fine colour differences in an unclassed map.

Further resources

Bivand, R. S., Pebesma, E. J., & Rubio, V. G. (2013). Applied spatial data analysis with R. Springer. 2nd ed.

Brunsdon, Chris, and Lex Comber. (2015). An Introduction to R for Spatial Analysis and Mapping. London: Sage.

Cairo, A. (2016). The Truthful Art: Data, Charts and Maps for Communication, New Riders, pp. 280–292.

Lovelace, R., Nowosad, J., & Muenchow, J. (in progress). Geocomputation with R. Available online via https://geocompr.robinlovelace.net/

*This is a nod to Mike Bostock’s Let’s Make a Map