Interactive Mapping in Python with UK Census Data

Patrick O'Hara
11 min readFeb 18, 2022

--

Introduction

In this article I will describe the process I followed to create this dashboard displaying maps of London showing UK Census Data:

Interactive Dashboard showing UK Census Data for London

I am new to GIS and mapping, so this was a voyage of discovery — I will describe what I learned, and some of the issues I encountered along the way, with their resolutions.

I assume familiarity with Python. The most common Python data visualization libraries are compared in this article The Top 6 Python Data Visualization Libraries: How to choose. Their summary was:

  • Matplotlib — foundation and comprehensive library, highly customizable, not very easy to learn and use
  • Seaborn — easy to learn and use, good for exploratory data analysis, relatively customizable, can integrate with Matplotlib if need more customizations
  • Plotly (Dash) — easy to use and customizable, create interactive graphs and web applications
  • Bokeh — create interactive graphs and web applications
  • Folium — good for nice looking map graphs
  • Plotnine — grammar of graphics, based on the R ggplot2 library

The steps I followed, which will be described in detail below, were:

  1. Download UK Census Data, and Census Ward and Local Authority geographic information (as Shapefiles).
  2. Plot a simple fixed map using GeoPandas and Matplotlib.
  3. Change to use Plotly with GeoJSON geographic data.
  4. Add interactive functionality using Dash.

My code is available in my Github repository. When I show code fragments below I will refer to the file in the repository where it appears.

UK Census Data

Data

The UK Office for National Statistics is responsible for the Census. The 2021 Census results are still being prepared, the schedule is discussed here. The 2011 Census data is described and can be downloaded here. For this exercise, I downloaded Bulk Data which is described as:

Bulk data products enable users, particularly managers of information systems, to load large volumes of tables into their own systems in automated ways.

The data is compiled by various areas:

Datasets available in standard bulk format generally contain data only for output areas, lower and middle layer super output areas, wards, local authorities, unitary authorities, counties, regions, and countries.

I downloaded the Detailed Characteristics on demography and families for merged wards, local authorities and regions. This is a 79MB Zip file, containing data in a structured CSV format, with an Excel catalogue. I extracted the files into a subdirectory of my project data directory. Let’s look at the code to read the data.

Tables

The different tables are described in the Excel catalogue Cell Numbered DC Tables 3.3.xlsx. The Index sheet of this file lists the tables, and I read it as follows. (This code appears in the file census_read_data.py in the repository.)

The index DataFrame looks like this:

   Table Number Table Title
0 DC1104EW DC1104EW: Residence type by sex by age
1 DC1106EW DC1106EW: Schoolchildren and full-time student…
2 DC1107EW DC1107EW: Marital and civil partnership status…
3 DC1108EW DC1108EW: Living arrangements by sex by age
4 DC1109EW DC1109EW: Household composition by age by sex

There are 25 tables of different statistics; each has its own sheet in the catalogue. Looking at the sheet for the first table DC1104EW, we see:

The table has categories for Residence Type, Sex and Age, with values for each combination of category values, including All categories. The values in the table are the index of the column in the data file for the statistic. The categories vary for each table, with one or two categories in columns, and one or two in the rows. I could not find a way to automatically cope with these varied column and row headings using pandas. Instead, I read the sheet into a DataFrame and then process the headings in my own code, part of which is shown below.

Calling read_table for DC1104EW gives this DataFrame:

   Age Residence type       Sex     Dataset
0 All All All 0001
1 All All Males 0002
2 All All Females 0003
3 All Lives in a household All 0004
4 All Lives in a household Males 0005

Each row in the DataFrame identifies the Dataset for a combination of the category values. The first row, with the value All for the three categories identifies Dataset 0001, which corresponds to the column DC1104EW0001 in the data file DC1104EWDATA.CSV.

Data Files

Each table has many CSV files; the file we load for DC1104EW is DC1104EWDATA.CSV. As explained above, this data has columns for the different combinations of the category values. The rows have the counts for different geographical areas. We read the data as follows:

The head of the DataFrame for table DC1104EW is:

   GeographyCode  DC1104EW0001  DC1104EW0002   ...  DC1104EW0198
0 K04000001 56075912 27573376 ... 156146
1 E92000001 53012456 26069148 ... 146915
2 W92000004 3063456 1504228 ... 9231
3 E12000001 2596886 1269703 ... 7608
4 E12000002 7052177 3464685 ... 20369

Geography

The census data is summarised by Merged Ward and by Local Authority District. Merged Wards refer to Electoral Wards, where a few small wards are merged to protect privacy. Merged Wards are assigned to Local Authorities, which are themselves assigned to Regions. The geography data is published on the Open Geography Portal in a number of formats:

  • Shapefile — A geospatial vector data format for geographic information system (GIS) software. It is developed and regulated by Esri.
  • GeoJSON — An open standard format designed for representing simple geographical features, along with their non-spatial attributes, based on JSON.
  • KML — Keyhole Markup Language is an XML format developed for use with Google Earth.

The Shapefile format is much more compact than GeoJSON, and is supported by GeoPandas, see below, so this is what I chose to download. (Plotly requires GeoJSON, which I created from the Shapefiles later.)

The geography data for Wards and Local Authority Districts (LADs) that I used is on the Open Geography Portal under the menu options Boundaries | Census Boundaries | Census Merged Wards and Boundaries | Administrative Boundaries | Local Authority Districts. The files I downloaded were:

  1. Census_Merged_Wards_(December_2011)_Boundaries — Download Shapefile format.
  2. Local_Authority_Districts_(December_2011)_Boundaries_EW_BFC — Download Shapefile format.
  3. Ward_to_Census_Merged_Ward_to_Local_Authority_District_(December_2011)_Lookup_in_England_and_Wales — Download CSV format

The Shapefiles I downloaded are high resolution, so large: about 120MB and 40MB respectively. The portal has lower resolution versions that are a tenth of the size if you prefer to use those. (Alternatively, you could use a site like mapshaper.org to compress the files to your preferred resolution.)

These Shapefiles have Coordinate Reference System OSGB36 / British National Grid. In order to map the data we need to change it to EPSG 4326 (aka WGS84), we will see this in the code below. (I am afraid that it took me a long frustrating time, during which no maps were displayed by Plotly, to find this out!)

I used the lookup CSV file to create a list geography lookup DataFrame, with rows for Merged Wards and Local Authority Districts, and columns GeographyCode and Name:

GeoPandas and Matplotlib

Having downloaded the data, we are ready to produce our first map! While I downloaded data for the whole of England and Wales, I will restrict the mapping to London for simplicity.

GeoPandas is an open source project to make working with geospatial data in Python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. Geopandas further depends on Fiona for file access and Matplotlib for plotting.

I installed GeoPandas on Windows and it failed. Because it depends on packages that are implemented in C/C++, special procedures are required to install it on Windows, these are described in the Appendix below. (Apparently the install is straightforward on Linux and Mac.)

First, I loaded the Shapefile. (Code is in census_read_geopandas.py.)

GeoPandas loads the Shapefile, and then we convert the co-ordinates as discussed above, and filter the rows by the LAD code for London LADs.

Here is simple code to map the LADs. (Code is in census_geopandas_script.py.)

This displays a window with default formatting:

We can load the data to plot on the map as follows:

This displays a window with a choropleth map, colour coded according to the first dataset for the table DC1104EW, i.e. the column DC1104EW0001:

GeoJSON and Plotly

In order to add interactive features I used Plotly and Dash. Plotly requires geographical data in GeoJSON format. This format is much larger than Shapefile, and the conversion from Shapefile is time consuming — about 45 seconds for the Ward file on my environment.

I decided to cache the GeoJSON locally. (Code in file census_read_geojson.py.)

There are two cached JSON files: the London LAD Boundaries, and the complete LAD Boundaries. The JSON files are simply converted from GeoPandas.

The first Plotly map uses the GeoJSON files. (Code in census_plotly_script.py.)

Observations on the code:

  • The px.choropleth function call maps the london_lads GeoJson, colouring the map according to the datacol (DC1104EW0001) column of the london_lad_df, matching the GeoJson feature property key (properties.lad11cd) with the locationcol (GeographyCode) column.
  • The update_geos function call sets the bounds of the map from the displayed locations, and hides the underlying map.
  • The update_layout function call reduces the margin around the map, and specifies the width and height.

This script creates a browser window:

The code specifies the width and height because a) the default size is smaller, and b) in order to have an appropriate aspect ratio. However, the map is still surrounded by a large amount of white space. I improved this by manually specifying the bounds for the map:

Observations on the code:

  • The package turfpy.measurement provides a function bbox to compute the bounding box for a GeoJSON feature geometry.
  • The function compute_bbox computes the bounding box for each feature and reduces the list of bounding boxes to compute the combined bounding box.
  • Then update_geos uses the box to specify the center and longitude and latitude ranges for the map.

Dash

Dash is built on top of Plotly and

abstracts away all of the technologies and protocols that are required to build a full-stack web app with interactive data visualization

I used Dash to allow selection of the table, dataset and granularity for the map.

In my first attempt I used the Dash Core Components to add the selection controls. While functional, the appearance was not great, so I switched to using Dash Bootstrap Components, which provide the consistent Bootstrap look and feel to the controls without needing CSS expertise.

The first version just allowed selection of the map granularity, adding these controls:

The Granularity radio items change the granularity of the map between Local Authority, which we have seen so far, and Ward, for a more detailed map. The dropdown selection optionally specifies the Local Authority for a Ward map.

The code is in the file census_dash_script_simple.py.

Observations on the code:

  • This script reads the Ward data, in addition to the LAD data, using crd.read_london_ward_geojson().
  • The Dash functionality starts after the comment # Dash.
  • The initial figure displayed by Dash is a chart, which I did not want to see, so the function blank_fig() creates a blank figure to use as the initial display. (Credit to this Stack Overflow answer.)
  • The call to dash.Dash() creates the Dash application, using the standard Bootstrap stylesheet.
  • The assignment to map_controls creates the selection controls.
  • The assignment to app.layout creates a container for the page with heading, map_controls and placeholder for the map initially showing the blank figure.
  • The @app.callback annotation defines a callback function that is called when either of the selection controls is updated, returning the updated figure.
  • The callback builds the map, using Plotly as before, using the data appropriate to the controls. One minor change is the hover_data=[namecol, ‘LAD11NM’], which adds the LAD name to the hover display.
  • app.run_server() starts the Dash server. In my environment it is accessed on http://127.0.0.1:8050/ and displays this page:

If I change the granularity to Ward and select the Local Authority Ealing I get this map; the image shows the hover text with the LAD name:

Next I added controls to select the table and dataset. This required many more inputs and outputs on the callback. Ideally, I would like to have multiple callbacks chained together as described in Dash Basic Callbacks. However, callbacks must be stateless, so it is not possible for the table selection to update global state with the table data to be displayed on the dataset selection. So, I ended up with one large callback that does everything.

The code is in census_dash_script_full.py. I will not reproduce it here, but summarise the changes:

  • The assignment to table_controls creates the table selection control.
  • The assignment to category_controls creates four category selection controls. Initially they are empty and the style is set to display none so they are invisible.
  • The app.layout includes the new controls before the map_controls from above.
  • The app.callback defines additional outputs for the category control label, values and style, and additional inputs for the table and category values.

The callback proceeds as follows:

  • If the table_name is not set then raise the exception PreventUpdate so the figure is not updated.
  • Call crd.get_table_column_names_and_values() to get the category names and values for the table.
  • Set the outputs for the category names and values. Also, for categories that have a value already selected build a query string to filter the table by the value.
  • If all the table categories have selected values then query the table to get the dataset name, load the data file, and display the map for the appropriate column.

A sample display is shown below:

The next display shows Ward level data for all LADs in London:

Conclusion

This article describes my journey building a solution for a non-trivial mapping requirement using GeoPandas and Matplotlib, and then Plotly and Dash. Along the way I learnt a lot, and solved some specific problems:

  • Installing GeoPandas on Windows
  • Ensuring Shapefiles use the appropriate Coordinate Reference System (CRS)
  • Sizing a Plotly map by computing the bounding box
  • Creating a Dash dashboard that is initially blank
  • Creating a Dash dashboard with hidden controls

I hope you find it useful!

Appendix — Python Packages

I used Python 3.9 on Windows 11 and normally use pip to install packages. However, geopandas depends on packages that are implemented in C/C++, so special procedures are required to install it on Windows. (Apparently the install is straightforward on Linux and Mac.)

The article Using geopandas on Windows by Geoff Boeing is referenced as the definitive explanation of how to install geopandas on Windows. The comments on the article have many suggestions on how best to proceed. The essential advice is to use pipwin which installs unofficial python package binaries for Windows provided by Christoph Gohlke here. These are the steps I followed:

pip install wheel
pip install pipwin
pipwin install numpy
pipwin install pandas
pipwin install gdal
# Add the new GDAL path to the Windows PATH environment variable,
# e.g. C:\Users\<username>...
# \AppData\Local\Programs\Python\Python39\Lib\site-packages\osgeo
# It seems this must be done before installing Fiona.
pipwin install fiona
pipwin install pyproj
pipwin install rtree
pipwin install shapely
pipwin install six
pip install geopandas

--

--