Making Census Data Open, Accessible and Useful for Real People

Ahmad Barclay
Oct 12, 2020 · 13 min read
Dot density mapping of housing types in London, house = blue, apartment = magenta (Credit: Ahmad Barclay/ONS)

This year and next are seeing 10-year national population and housing censuses taking place in scores of countries around the world (India, Brazil, Nigeria, USA, Germany, Turkey, etc.), and with these, we’ll be seeing huge volumes of data to try to understand and communicate.

Back in April this year, I joined the data visualisation team at the Office for National Statistics (ONS) in the UK as the first person working full-time to explore the kinds of visual outputs we will be developing for the 2021 Census here, contributing to a broad aim of “making census data open, accessible and useful for real people.” This has been a kind of deep-dive effort to get to grips with a bewildering array of possible formats and technologies available to us.

A couple of months after starting this effort, I had the opportunity — along with my ONS colleague Henry Lau — to host a Data Visualization Society Slack discussion to open up our own internal conversations with members of the DVS community.

This article shares some of the insights from this discussion, and also from the exploratory research and prototyping that we’ve been doing within the ONS. My aim here is to write the article that I wish I could have read all those months ago when I started work in my present role, to offer some starting points for people interested in visually communicating census data, and to identify where things may be headed in this space over the next couple of years.

Slack poll results that shows that the majority of respondents have made use of census data in their work.

Why should we care about census data?

Navigating through the hierarchy of UK census geographies (Credit: Ahmad Barclay/ONS)

The level of granularity of the census data released by different countries varies a lot, but in the case of the UK, the smallest areas amount to roughly 100 households, or 200–300 individuals, and each of these areas may have upwards of 50 tables of data, covering age, gender, ethnicity, social class, employment, health, housing and a range of associated indicators.

In some countries under-counting of communities in the census may lead directly to a reduction in central government funding (Credit: Mona Chalabi)

Why do we gather all this data? Well, one primary driver for collecting such granular data is in order to properly allocate government resources in an evidence-led manner, for example, to make sure that there will be enough primary school places in an area that’s seen a recent baby boom, or that health services are properly funded where the population is getting older.

What does it mean to have this data? Aside from allocating government resources, census data helps us to build up a picture not only of how a country is changing over time, but also of the huge variation within a country, from city to city, rural to urban, or even from neighbourhood to neighbourhood, and what this tells us about our societies, for good or bad… And, without doubt, there is an important role for data visualisation in comprehending and communicating such a complex picture.

The art of the possible: formats

In terms of formats, broadly, we saw five main typologies highlighted in the discussion, of which I’ll describe and share examples of below:

  1. Exploratory platforms
  2. Micro-outputs
  3. Guided experiences
  4. Personalised/localised outputs
  5. Games/gamified outputs

1. Exploratory platforms

Left: DataShine UK census mapping (Credit: Oliver O’Brien & James Cheshire) | Right: Data USA area profile (Credit: Datawheel/Deloitte)

One of the biggest platforms for me was DataShine. You could map pretty much any question that was asked in the UK’s 2011 Census. — Henry Lau

It should be noted that the more effective exploratory tools and platforms tend to be designed around the kinds of insights that users are likely to gain from the data (e.g., trends, correlations or comparisons), rather than offering an entirely agnostic “chart builder”-type interface.

2. Micro-outputs

Left: ”A day in the life of Americans” (Credit: Matt Hong) | Right: “200 Years of American Immigration” (Credit: Pedro Cruz, et al.)

Matt Hong’s data comic on “A day in the life of Americans” stands out to me. It’s not just because of the unconventional format, but it’s also how it humanised the statistics in the time use survey. — Alexandra Khoo

These micro-outputs are also valuable as a social media and mobile-friendly entry point into more exploratory formats. However, one related format that does seem to be in decline — at least on the web — is the large-scale static infographic, which is no longer accessible to the majority of web users that browse primarily on mobile devices.

3. Guided experiences

Left: “Rich School, Poor School” (Credit: ABC Australia) | Middle: College mobility in the US (Credit: Raj Chetty/NY Times) | Right: “The Evolution of the American Census” (Credit: Alec Barret/The Pudding)

The ABC in Australia are the masters of turning our census data into scrollytelling pieces. — Darragh Murray

Scrollytelling has been pretty ubiquitous in dataviz for a while now, and we should probably be expecting new successor formats to emerge. But — regardless of the specific format — guided, narrative-driven dataviz is effective for its potential to bridge the divide between the freedom of “exploratory” dataviz, and the instant accessibility of “explanatory” formats like traditional infographics.

4. Personalised/localised outputs

Left: “How Many Households Are Like Yours” (Credit: NY Times) | Right: SALCO “semi-automated local content” (Credit: BBC News Labs)

This category includes “robo-journalism” — or “semi-automated journalism” — where data- and logic-driven article templates are used to produce hundreds or thousands of localised variants of a given story. There are also some interesting examples in this category of projects that take a big population dataset (like census) and allow users to find out more about people/households that are like them.

5. Games/gamified outputs

An interactive called “Spotlight” helped people to understand how Australia’s 2011 data census related to them as individuals (Credit: Australian Bureau of Statistics)

This is another broad category, covering data-driven games and quizzes, and also other kinds of outputs that have game-like qualities. These kinds of outputs may not be part of a core informational offering, but can often prove to be a valuable way of engaging a broader public/audience that might not see the census as relevant to them.

For example, my ONS colleagues had an experience where the release of a “how well do you know your area?” style quiz overloaded the website’s API due to unexpectedly heavy usage.

The art of the possible: Technology

  1. Front-end libraries
  2. Back-end infrastructure
  3. Automation

1. Front-end

I use Svelte with d3 and it’s a blast […] Svelte has a ton of things built in that I would have needed to import a bunch of libraries for . — Sandy Guberti-Ng

WebGL offers the possibility to render hundreds of thousands of data points in the browser, and even to animate them in 2D or 3D with the benefit of hardware acceleration. This offers a potential step-change for big dataviz compared to SVG or Canvas rendering, and is already baked in to a few visualisation-focused libraries, for example Mapbox GL and Vis.gl (which powers Deck.gl and Kepler.gl), as well as the more all-purpose Three.js.

Left: High-performance rendering with Mapbox GL (Credit: Morphocode) | Middle: Mapping individual daily commutes in New Zealand using Three.js (Credit: Karim Douieb) | Right: Dot density mapping calculated on-the-fly in Mapbox GL (Credit: Ahmad Barclay/ONS)

In the case of pre-compiled Javascript frameworks, the logic is that they do away with a “virtual DOM” abstraction layer of frameworks like React, Vue and Angular, and only compile the raw “vanilla” Javascript code your app needs to run. This means that, in most scenarios, they can offer faster performance and a smaller file size. So you get the benefits of a component-based framework — importing all the bits you need from dataviz libraries like D3.js, plus a bunch of built-in features — but at the same time it rarely feels like an overkill approach, even for smaller projects.

2. Back-end

First, in the UK Census, we’re looking at a raw dataset of over 50 million “rows” of data — one for each individual resident — and each containing scores of observations (on age, gender, employment, social class, ethnicity, health, housing, etc.), all of which have the potential to be cross-tabulated. Furthermore, this data always needs to be aggregated in a way that doesn’t disclose data that could identify individual people or households, so can’t just be delivered raw.

Within ONS, there are at least three parallel data architecture solutions being explored for the 2021 Census, with each one playing some kind of balancing act between speed and flexibility (e.g., allowing highly customised queries). I should note that these platforms aim to deliver data for a lot of different purposes, with dataviz apps only being one of them.

At one end of the spectrum, we are investigating pipelines (from simple batch-scripts to cloud-based solutions) to pre-packaged chunks of data into flat files (JSON, CSV, etc.) that would allow us to return common datasets instantly without the need to query an API. At the other end of the spectrum, a custom-built “flexible table builder” would allow people to access pretty much any cut of data they’re interested in — with built-in controls to prevent individual households from being identified — offering maximum flexibility with some inevitable degree of tradeoff on speed.

There are also efforts being invested in solutions somewhere in the middle of this spectrum, for example an API that can flexibly aggregate small chunks of data from a graph database, while attempting to square the performance circle by caching results, so that any common query (like names/codes for geographic areas) can be returned almost instantaneously.

3. Datasets for automation

Datasets for area profiles on Census Reporter are pre-generated to reduce the need for multiple API calls (Credit: Census Reporter)

Another topic that we’re exploring is how to deliver data formats suitable for semi-automated journalism. A typical use case here, to generate data-driven “area profiles” for hundreds or thousands of similar geographies, requires us to produce a data file for each of those places, which might include a series of indicators on population age, density and ethnicity, along with housing, health and labour market conditions. It should also show how these indicators have changed over time, and how they rank alongside all the other similar geographies and/or compare to regional- and national-level indicators.

Producing these datasets will almost certainly require “pipelines” (as referred to above) to pre-package data, since aggregating this data on-the-fly would mean requesting data from many different tables at once for an essentially static output, which simply doesn’t make sense at scale. The Knight Foundation-funded Census Reporter project adopts this kind of approach to formatting datasets for the tens of thousands of areas covered by its area profiles.

The art of the useful

Detroit census response rate map (Credit: City of Detroit)

While most of DVS would consider this to be a very simplistic visualization, the overlay that includes internet access SIGNIFICANTLY repositioned the narrative on the digital divide in the City of Detroit. — Mary Aviles

We ran a poll on the Slack discussion, admittedly not the most representative of samples, but the results were interesting. Almost nobody favoured either exploratory platforms — which have long been a staple of “big data” visualisation — and, equally, static “micro-outputs” won little support. Instead, the favourites were “guided experiences” and “personalised/localised outputs”.

Slack poll that shows respondents prefer guided experiences and personalised/localised outputs.
Slack poll that shows respondents prefer guided experiences and personalised/localised outputs.

I think this response speaks to a growing realisation that there is a relatively low level of data literacy in the public at large, and that it is important to offer accessible stories and insights that can connect with the specific issues and places that matter to individual people.

It is also worth noting here that our early user testing suggests that the expert users capable of engaging with a more open-ended data exploration experience often prefer to download the raw data and visualise it themselves using whichever one of the vast array of dataviz tools on the market best suits their specific needs. Thus, it’s quite possible that custom-built exploratory dataviz platforms and chart builders may be serving an ever-diminishing audience.

However, at the other end of the spectrum, I tend to disagree with the apparent writing off of made-for-mobile/social media dataviz, which I would see as playing an important role in bringing insights to a wider general public audience. In this age of “fake news”, I would see clear and relatable visuals from trusted sources as an essential means to disseminate accurate information, as we have seen during the COVID-19 pandemic.

The future?

In relation to my own work, this is only the first step in a pretty long journey. In the UK, the next census is planned to take place in March 2021, with the data to be released sometime in mid 2022, and I think it’s fair to say that we can expect to see important innovations in this space between now and then.

Acknowledgements

And finally, of course, a big thank you to the many DVS members who contributed to the Slack discussion, including Mary Aviles, Alexandra Khoo, Amanda Makulec, Ben Oldenburg, Andrew Whitby, Nicole Edmonds, Darragh Murray, Phil Hawkins, Alok, Karl Tryggvason, Sandy Guberti-Ng, Guillaume Meigniez, Jan Zak, Steffen Kühne, Alan Wilson and Darren Lim.

Ahmad Barclay is an architect and UX designer presently working on census data visualisation with the Office for National Statistics (ONS) in the UK. He was previously a founding partner with the award winning social impact dataviz studio Visualizing Impact and co-initiated the historical mapping and open data project Palestine Open Maps.

Nightingale

The Journal of the Data Visualization Society