TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Working with Geospatial Data at the Postcode Level

9 min readJul 6, 2023

--

In some countries, postcodes are points or routes and not areas. For example, the last three digits of a Canadian postcode correspond to the local delivery unit which may correspond to houses on one side of a street or rural route. Similarly, UK postcodes have a postcode of the form “YO8 9UR”. This could be as small as a single building in London. In a 5+4 US zipcode, the last four numbers determine a postal delivery route (so, a set of addresses) and not an area. Contrary to common belief, US 5-digit zipcodes are not areas either — they are simply a collection of the 5+4 postal routes, typically served from a single post office.

France, as befits the originator of the metric system, is very logical. In France, postcodes correspond to an area — the last two digits correspond to the arrondissement, thus 75008 corresponds to the 8th arrondissement of Paris and is truly an area. Mail delivery routes are probably suboptimal, though.

Because people and stores have addresses, which have associated postcodes, most consumer data is reported at a postcode level. In order to carry out computations such as areal coverage, market share, etc. it is necessary to determine the areal extent of a postcode. This is easy in France, but will be difficult in any country where postcodes are postal routes and not areas.

UK postcodes are Royal Mail delivery addresses, not areas. This is true of Canada and the United States also. Photo by Monty Allen on Unsplash

Because their postcodes are mail delivery addresses, there are infinitely many polygons that can be drawn to partition the UK/Canada/US into valid postcode “regions”. This is why UK demographic data is published by their Office of National Statistics (ONS) on administrative areas (such as counties), not postcodes. The US census publishes data at a “Zip code tabulation area” (ZCTA) level, and US voting data is published at the county level. When working with UK/Canada/US data, you’ll often have a similar mixture of addresses (which are points) and spatial data collected over an area. How do you associate these together?

To illustrate, I’ll tie together UK postcode data and census data in this article.

Download link

If you are in a hurry, you can download the results of my analysis from https://github.com/lakshmanok/lakblogs/tree/main/uk_postcode — there are a couple of CSV files there, and they contain the data you may need.

ukpopulation.csv.gz has the following columns:

postcode,latitude,longitude,area_code,area_name,all_persons,females,males

ukpostcodes.csv.gz has one extra column — the polygon for each postcode in WKT format:

postcode,latitude,longitude,area_code,area_name,all_persons,females,males,geometry_wkt

Please note that use of the data or code is at your own risk — it is distributed on an “AS IS” BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

In this article, I’ll step through how I created the dataset in that GitHub repo. You can follow along with me using the notebook uk_postcodes.ipynb.

Raw data

We start from three sources of raw data released under the UK Open Government License:

  1. Free Map Tools has a downloadable file that has the centroid lat and lon for each postcode. This is not enough for spatial analysis since you can not do things like ST_INTERSECTS with just a point. But it’s a good start.
  2. The ONS has published census data for areas such as “County Durham”. These are not postcodes, though. They are typically at the ward, county, or region level.
  3. The UK statistics office has helpfully identified the areas associated with each postcode. Every postcode lives in different areas defined for different purposes and resolutions. This includes, but is not limited to the parish, ward, county, and region that the region is in. For completeness, here are all the other associations available (depending on your spatial dataset, you may need the other columns):
pcd,pcd2,pcds,dointr,doterm,oscty,ced,oslaua,osward,parish,usertype,oseast1m,osnrth1m,osgrdind,oshlthau,nhser,ctry,rgn,streg,pcon,eer,teclec,ttwa,pct,itl,statsward,oa01,casward,park,lsoa01,msoa01,ur01ind,oac01,oa11,lsoa11,msoa11,wz11,sicbl,bua11,buasd11,ru11ind,oac11,lat,long,lep1,lep2,pfa,imd,calncv,icb,oa21,lsoa21,msoa21

My notebook downloads the data files using wget:

mkdir -p indata
cd indata
if [ ! -f census2021.xlsx ]; then
wget -O census2021.xlsx https://www.ons.gov.uk/file?uri=/peoplepopulationandcommunity/populationandmigration/populationestimates/datasets/populationandhouseholdestimatesenglandandwalescensus2021/census2021/census2021firstresultsenglandwales1.xlsx
fi

Reading data

Reading CSV directly into Pandas is straightforward:

import pandas as pd
pd.read_csv(POSTCODES_CSV)

This gives me the centroid lat-lon of every postcode:

Centroid lat-lon for every postcode

There are many packages that allow you to read Excel files into Pandas, but I decided to use DuckDB because I’ll be using it later in the notebook to join the three datasets using SQL:

import duckdb
conn = duckdb.connect()

ukpop = conn.execute(f"""
install spatial;
load spatial;

SELECT
*
FROM
st_read('{POPULATION_XLS}', layer='P01')
""").df()

This Excel file has 7 rows of header info that I can drop. I also rename the columns to meaningful variables:

ukpop = ukpop.drop(range(0,7), axis=0).rename(columns={
'Field1': 'area_code',
'Field2': 'area_name',
'Field3': 'all_persons',
'Field4': 'females',
'Field5': 'males'
})

That was the sheet named P01. Note that the P04 sheet has a population density information, but it is not useful because population is not distributed evenly over the area code. We’ll derive the population of each postcode.

I write this out to a CSV file so that I can easily read it from DuckDB.

ukpop.to_csv("temp/ukpop.csv", index=False)

Similarly, I extract the necessary columns from the UK statistics office file and write it to a CSV file:

onspd = pd.read_csv(ONSPD_CSV)
onspd = onspd[['pcd', 'oslaua', 'osward', 'parish']]
onspd.to_csv("temp/onspd.csv", index=False)

Associating data

Now, we can use DuckDB to join the three prepared datasets to get the population density at every postcode. Why DuckDB? Even though I can do the join in Pandas, I find SQL to be much more readable. Besides, this gave me an excuse to use the new hot thing.

I join the datasets by first reading them into DuckB using read_csv_auto. Then, I look up the ward, parish, county that the postcode is in and find the area (parish, ward, or county) that the population density data is reported at:


/* pcd,oslaua,osward,parish */
WITH onspd AS (
SELECT
*
FROM
read_csv_auto('temp/onspd.csv', header=True)
),

/* area_code,area_name,all_persons,females,males */
ukpop AS (
SELECT
*
FROM
read_csv_auto('temp/ukpop.csv', header=True)
),

/* id,postcode,latitude,longitude */
postcodes AS (
SELECT
*
FROM
read_csv_auto('indata/ukpostcodes.csv', header=True)
),

/* postcode, area_code */
postcode_to_areacode AS (
SELECT
pcd AS postcode,
ANY_VALUE(area_code) as area_code
FROM onspd
JOIN ukpop
ON (area_code = oslaua OR area_code = osward OR area_code = parish)
GROUP BY pcd
)

SELECT
postcode, latitude, longitude, /* from postcodes */
area_code, area_name, /* from ukpop */
all_persons,females,males /* from ukpop, but has to be spatially corrected */
FROM postcode_to_areacode
JOIN postcodes USING (postcode)
JOIN ukpop USING (area_code)

Note that the spatial quantities are scalars that correspond to the whole area and not the postcode. They have to be split among the postcodes.

Split area quantities among the postcodes

The all_persons, females, males above correspond to the whole area, not to the specific postcode. We could do it proportionally based on area of the postcode, but there are infinitely many polygons that can fit the postcodes, and as we will see later, the areal extent of postcodes near parks and lakes are a bit iffy. So we’ll do something simple that gives us a single, unique answer — we’ll split the scalar value evenly among all the postcodes in the area! This is not as strange as it sounds — in higher density neighborhoods, there are more postcodes, so dividing equally among the postcodes is akin to distributing the scalar quantity proportional to the population density.

npostcodes = ukpop.groupby('area_code')['postcode'].count()
for col in ['females', 'males', 'all_persons']:
ukpop[col] = ukpop.apply(lambda row: row[col]/npostcodes[row['area_code']], axis=1)

At this point, we have the quantity for each postcode — this is the association that we needed:

So, write it out:

ukpop.to_csv("ukpopulation.csv", index=False)

Areal extent of postcodes

For many analyses, we’ll want the postcodes to be not points but areas. Even though there are infinitely many polygons that we can use to divide the UK such that there is only postcode centroid in each polygon, there does exist a notion of the “best” polygon. That is the Voronoi partition, which splits the area such that any point belongs to the postcode closest to it:

Voronoi analysis of 20 points, from Wikipedia

To compute this, we can use scipy:

import numpy as np
from scipy.spatial import Voronoi, voronoi_plot_2d

points = df[['latitude', 'longitude']].to_numpy()
vor = Voronoi(points)

I’m assuming here that the area is small enough that there isn’t much of a difference between the geodesic distance and Euclidean distance computed from the latitude and longitude. UK postcodes are small enough that this is the case.

The result is organized such that, for every point, there is a region consisting of a set of vertices. We can create a WKT polygon string for each point using:

def make_polygon(point_no):
region_no = vor.point_region[point_no]
region = vor.regions[region_no]
if len(region) >= 3:
# close the ring
closed_region = region.copy()
closed_region.append(closed_region[0])
# create a WKT of the points
polygon = "POLYGON ((" + ','.join([ f"{vor.vertices[v][1]} {vor.vertices[v][0]}" for v in closed_region]) + "))"
return polygon
else:
return None

Here’s an example result:

POLYGON ((-0.32491691953979235 51.7393550489536,-0.32527234008402217 51.73948967705648,-0.32515738641624575 51.73987124225542,-0.3241646650618929 51.74087626616231,-0.3215663358407994 51.742660660928614,-0.32145633473723817 51.742228570262824,-0.32491691953979235 51.7393550489536))

We can create a GeoDataFrame and plot a subset of postcodes:

import geopandas as gpd
from shapely import wkt
df['geometry'] = gpd.GeoSeries.from_wkt(df['geometry_wkt'])
gdf = gpd.GeoDataFrame(df, geometry='geometry')

gdf[gdf['area_name'] == 'St Albans'].plot()
Postcode spatial extent for St. Albans. Image generated by author.

Here’s Birmingham:

gdf[gdf['area_name'] == 'Birmingham'].plot()
Postcode spatial extent for Birmingham. Image generated by author.

Unpopulated areas

Note the horn at the top and the large area of blue in the middle. What’s going on? Let’s look at Birmingham in Google Maps:

Birmingham, screenshot of Google Maps by author.

Notice the park areas? The Royal Mail doesn’t have to deliver to anyone there. So there are no postcodes there. Therefore, the nearby postcodes get “extended” into those areas. This will cause problems in spatial calculations as those postcodes will appear to be much larger than they are.

To fix this, I’ll take a rather heuristic approach. I’ll grid the UK into 0.01x0.01 (approximately 1 sq km) resolution grid cells and find grid cells that have no postcodes in them:

GRIDRES = 0.01
min_lat, max_lat = np.round(min(df['latitude']), 2) - GRIDRES, max(df['latitude']) + GRIDRES
min_lon, max_lon = np.round(min(df['longitude']), 2) - GRIDRES, max(df['longitude']) + GRIDRES
print(min_lat, max_lat, min_lon, max_lon)

npostcodes = np.zeros([ int(1+(max_lat-min_lat)/GRIDRES), int(1+(max_lon-min_lon)/GRIDRES) ])
for point in points:
latno = int((point[0] - min_lat)/GRIDRES)
lonno = int((point[1] - min_lon)/GRIDRES)
npostcodes[latno, lonno] += 1

unpop = []
for latno in range(len(npostcodes)):
for lonno in range(len(npostcodes[latno])):
if npostcodes[latno][lonno] == 0:
# no one lives here.
# make up a postcode for this location
# postcode latitude longitude area_code area_name persons_per_sqkm
unpop.append({
'postcode': f'UNPOP {latno}x{lonno}',
'latitude': min_lat + latno * 0.01,
'longitude': min_lon + lonno * 0.01,
'all_persons': 0
})

We’ll create fake postcodes in the center of such unpopulated grid cells, and assign a zero population density to those. Add these fake postcodes to the actual postcodes, and repeat the Voronoi analysis:

df2 = pd.concat([df, pd.DataFrame.from_records(unpop)])
points = df2[['latitude', 'longitude']].to_numpy()
vor = Voronoi(points)
df2['geometry_wkt'] = [make_polygon(x) for x in range(len(vor.point_region))]
df2['geometry'] = gpd.GeoSeries.from_wkt(df2['geometry_wkt'])
gdf = gpd.GeoDataFrame(df2, geometry='geometry')

Now, when we plot Birmingham, we get something much nicer:

Postcode polygons of Birmingham. Image generated by author

It is this dataframe that I saved as the second csv file:

gdf.to_csv("ukpostcodes.csv", index=False)

[Optional] Loading into BigQuery

We can load the CSV file into BigQuery and do some spatial analysis with it, but it is better to have BigQuery parse the last string column as a geometry first and have the data clustered by postcode:

CREATE OR REPLACE TABLE uk_public_data.postcode_popgeo2
CLUSTER BY postcode
AS

SELECT
* EXCEPT(geometry_wkt),
SAFE.ST_GEOGFROMTEXT(geometry_wkt, make_valid=>TRUE) AS geometry,
FROM uk_public_data.postcode_popgeo

Now, we can easily query it. For example, we can use ST_AREA for the postcodes:

SELECT 
COUNT(*) AS num_postcodes,
SUM(ST_AREA(geometry))/1e6 AS total_area,
SUM(all_persons) AS population,
area_name
FROM uk_public_data.postcode_popgeo2
GROUP BY area_name
ORDER BY population DESC

Summary

Spatial analysis often requires areal extent, not just point locations. In countries where postcodes are points/routes, there are infinitely many ways to generate a polygonal spatial extent for the postcodes. A reasonable approach is to use Voronoi regions to create polygons that contain those postcodes. However, if you do so, you will get unnaturally large polygons near lakes or parks where the post office does not deliver mail. To fix this, also grid the country and create artificial postcodes at unpopulated grid cells. In this article, I demonstrated how to do this for the UK. The associated notebook can be adapted to other places.

Next steps

  1. Check out the complete code at https://github.com/lakshmanok/lakblogs/blob/main/uk_postcode/uk_postcodes.ipynb
  2. Download the data from https://github.com/lakshmanok/lakblogs/tree/main/uk_postcode

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Lak Lakshmanan
Lak Lakshmanan

Written by Lak Lakshmanan

articles are personal observations and not investment advice.

No responses yet