This is a short introduction to creating choropleth maps in R. We’ll show you how to download Ordnance Survey boundary layers, merge them with 2011 Census data and visualise the results in classed and unclassed choropleth maps.

The full R code used in the tutorial is available from here.

Setup

install.packages("sf", "tidyverse", "classInt", "viridis")

and then load them into our R session:

library(sf) ; library(tidyverse) ; library(classInt) ; library(viridis)

Reading spatial data

We are going to download digital vector boundaries for electoral wards in Greater Manchester. The Portal provides vector boundary layers in three different resolutions (full, generalised and super generalised) and two formats (ESRI shapefile and GeoJSON). We are going to use the 2017 ward vector boundary layer with a super generalised resolution in GeoJSON format. A super generalised resolution is ideal for data visualisation — less detail means shorter loading times — and GeoJSON is an open format that comes in a single file.

You can identify the URL for the GeoJSON file by selecting the ‘APIs’ button and copying the GeoJSON path. Next we use the st_read function from the sf package to download the GeoJSON vector boundary layer and assign it to an object called ‘sf_gb’.

sf_gb <- st_read("https://opendata.arcgis.com/datasets/07194e4507ae491488471c84b23a90f2_3.geojson", quiet = TRUE)

The GeoJSON is loaded as an sf object, a data frame with features stored in rows and attributes in columns. The feature geometries of the object are stored in a list-column at the end. We can have a look at the attributes using the glimpse() function from the tidyverse package.

glimpse(sf_gb)

We note that the sf object has 8,669 features (rows) and 10 attributes (columns). In other words, there are 8,669 ward boundaries and 10 variables describing those features such as name (wd17nm) and total area (st_area_shape).

To inspect the geometry of the sf object we use the st_geometry function.

st_geometry(sf_gb)

The R console prints out the number of features, the geometry type, dimensions, bounding box, coordinate reference system (CRS), projection and the first five feature geometries (i.e. polygon coordinates). We can confirm that there are 8,669 features (wards) and that the vector boundary layer is projected in the World Geodetic System 1984 (WGS84), with longitude and latitude units of decimal degrees.

We can visualise the vector boundaries using a combination of the base R plot() function and st_geometry().

plot(st_geometry(sf_gb))
Wards in Great Britain

Filtering spatial data

We can use the read_csv function to load the lookup table, filter() the results by the ‘LAD17NM’ variable for local authorities in Greater Manchester, and pull() out the corresponding ward codes (WD17CD).

lookup <- read_csv("https://opendata.arcgis.com/datasets/046394602a6b415e9fe4039083ef300e_0.csv") %>%
filter(LAD17NM %in% c("Bolton","Bury","Manchester","Oldham","Rochdale","Salford","Stockport","Tameside","Trafford","Wigan")) %>%
pull(WD17CD)

Now we need to filter the sf object by the Greater Manchester ward codes. Since sf objects work seamlessly with the tidyverse we can use the same filter() function that we used for the ward-local authority lookup data frame.

sf_gm <- sf_gb %>%
filter(wd17cd %in% lookup)

Let’s check the results. We have all 215 electoral wards for Greater Manchester.

plot(st_geometry(sf_gm))
Wards in Greater Manchester

We don’t need all of the attributes in the ward vector layer though. So we’ll use the select() function to choose and rename the ward code and name variables.

sf_gm <- sf_gm %>%
select(area_code = wd17cd, area_name = wd17nm)

Joining spatial data

df <- read_csv("data/bulk.csv")

Next we need to identify the variables of interest, rename them and create a new variable using the mutate() function which represents the percentage of the usual resident population that have gained no qualifications.

df_census <- df %>%
select(area_code = `geography code`,
n = `Qualification: No qualifications; measures: Value`,
total = `Qualification: All usual residents aged 16 and over; measures: Value`) %>%
mutate(percent = (n/total)*100)

Now we are ready to join the Census data to the ward vector layer. This is achieved using the left_join() function and by supplying the sf object, data frame and the variable to join by.

sf_gm_census <- left_join(sf_gm, df_census, by = "area_code")

When we glimpse() the resulting object we observe six variables. The ward code, ward name and variables from the sf object and the three census variables from the data frame.

Creating a choropleth map

To create a classed choropleth map we need to assign groupings to the ‘percent’ variable. There are several ways to do this but the simplest is to assign an equal number of observations to each group. We can do this using the cut_number() function from the ggplot2 package. All we need to do is to supply the variable name and the number of class intervals.

Now let’s plot. There is a lot going on in this code, so I have provided comments in the code.

Equal intervals is a very blunt classification method. It’s not sensitive to outliers in a distribution and will often create empty classes. Natural breaks (Fisher-Jenks algorithm) is a more sophisticated classification method because it creates distinct groups of similar values by minimising the sum of variance in created classes. We’ll use the classIntervals() function from the classInt package. We just need to supply the ‘percent’ variable that we want to group into classes, the number of classes, and the classification method. We’ll choose 5 class intervals and the natural breaks or ‘jenks’ method for grouping values.

classes <- classIntervals(sf_gm_census$percent, n = 5, style = "jenks")

You can check the class intervals by running:

classes$brks

Next we’ll create a new column in our sf object using the base R cut() function to cut up our percent variable into distinct groups.

sf_gm_census <- sf_gm_census %>%
mutate(percent_class = cut(percent, classes$brks, include.lowest = T))

Now we are ready to plot again. The code is unchanged from the previous plot except that the fill aesthetic is mapped to the newly created ‘percent_class’ variable.

Finally, we’ll create an unclassed choropleth map using a continuous viridis colour palette. The code is largely unchanged except that the fill aesthetic is mapped to the ‘percent’ variable and we are using the scale_fill_viridis() function from the viridis package.

The decision whether to create a classed or unclassed choropleth map is a tradeoff between subjectivity and legibility. There is no ‘correct’ classification method to use in a classed choropleth map and it can can be difficult to judge fine colour differences in an unclassed map.

Further resources

Brunsdon, Chris, and Lex Comber. (2015). An Introduction to R for Spatial Analysis and Mapping. London: Sage.

Cairo, A. (2016). The Truthful Art: Data, Charts and Maps for Communication, New Riders, pp. 280–292.

Lovelace, R., Nowosad, J., & Muenchow, J. (in progress). Geocomputation with R. Available online via https://geocompr.robinlovelace.net/

*This is a nod to Mike Bostock’s Let’s Make a Map

Supporting decision-making in Trafford by revealing patterns in data through visualisation.